TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

TDWI Blog

Big Data Analytics Blog Posts

See the most recent Big Data Analytics related items below.

TDWI Blog: Data 360

To See and Be Seen: Social Media Data and Customer Intelligence

These days have been a whirlwind of projects. One of the biggest for me is the TDWI Best Practices Report I am working on, entitled “Customer Analytics in the Age of Social Media.” This report looks at what organizations are doing and could be doing to analyze information sources to improve their knowledge of and engagement with customers. Social media data is the revolutionary force in this realm; marketing functions are highly focused on how to take advantage social media both as a new channel and as a critical source of information about customer and market behavior. The heart of this report will be about how customer intelligence and analytics efforts are being reshaped by the influence of social media. This is exciting stuff.

When people talk about “big data,” much of the time they are talking about data generated by human behavior in social networks, blogs, chat rooms, comment fields, and more. Indeed, this can amount to a fast-moving, highly diverse “tsunami” of data that includes both internal (e.g., contact center interactions) and external sources. By discovering insights from this information, organizations can broaden and deepen their understanding of customers and get closer to a 360-degree view.

In addition, organizations can use social media data to gain an early view of the efficacy of marketing campaigns and product introductions. Many organizations are “listening” to such reactions in social media; leading organizations analyze the data rapidly and move quickly to adjust campaigns and engage in the social conversations to improve results.

To be sure, some organizations have serious reservations about social media data. First, not all organizations I have spoken with for the report find social media data to be trustworthy and take such analysis with a heavy grain of salt. My research found that while “gut feel” is losing out to the power of data analysis in most marketing functions, there’s still healthy debate about the real value of social media data to marketing decisions.

Second, while organizations at the leading edge of social media get a lot of attention, in a broad sense we are still in the early days. In our research, just 26 percent of participants said that their organizations are currently analyzing social media data; 22 percent are planning to do so within one year, while 21 percent have no plans to do so.

Where I found that organizations are gaining huge value is in drawing insights from social media to help them get closer to a 360-degree view of customer activity. Data silos are a problem in marketing; each channel often has its own dedicated applications and data. If organizations can correlate what they are seeing in social media with performance data from Web sites and other channels, they can begin to connect the dots across channels.

“Social media for us is not one isolated channel,” a data analyst at a large advertising services firm told me. “We use social media to gain an integrated view of the impact of our marketing across all of our channels, including billboards.” His organization is comparing social media data with their sources on marketing spending, customer transactions by location, and Web site performance. While not complete by itself, social media activity analysis enables a far more current view of marketing campaign performance than organizations have previously had.

“To see and be seen” is the credo of social media engagement. It isn’t enough to just listen; organizations have to be prepared to act. To do so intelligently, however, organizations must use social media data as not just a single source but as part of their integrated view of customer information.

Posted by David Stodder0 comments

Big Data Analytics: 2012 New Year's Predictions

By Philip Russom

Before January runs out, I thought I should tender a few prognostications for 2012. Sorry to be so late with this, but I have a demanding day job. Without further ado, here are a few trends, practices, and changes I feel we can expect in 2012.

Big data will get bigger. But, then, you knew that. Enough said.

The connection between big data and advanced analytics will get even stronger. My base assumption is that advanced analytics has become such an important priority for user organizations that it’s influencing most of what we do in business intelligence (BI), data warehousing (DW), and data management (DM). It even influences our attitudes toward big data. After all, the current frenzy – which will become more operationalized than ad hoc in 2012 – is to apply advanced analytic techniques to big data. In other words, don’t do one without the other, if you’re a BI professional.

From problem to opportunity. The survey for my recent TDWI report on Big Data Analytics shows that 70% of organizations already think of big data as an asset to be leveraged, largely through advanced analytics. In 2012, the other 30% will come around.

From hoarding to collecting. As a devotee of irony, I’m amused to see reality TV shows about collectibles and hoarding run back-to-back. Practices lauded in the former are abhorred in the latter, yet the line between collecting and hoarding is a thin one. Big data is a case in point. Many organizations have hoarded Web logs, RFID streams, and other big data sets for years. The same organizations are now turning the corner into collecting these with a dedicated purpose, namely analytics.

Advanced analytics will become as commonplace as OLAP. Okay, I admit that I’m exaggerating for dramatic effect. But, I have to say that big data alone has driven many organizations beyond OLAP into advanced forms of analytics, namely those based on mining, statistics, complex SQL, and natural language processing. This trend has been running for almost five years; there may be another five in it.

God is in the details. Or is the devil in the details? I guess it depends on what we’re talking about. With big data analytics, expect to see far more granular detail than ever before. For example, most 360-degree customer views today include hundreds of customer attributes. Big data can bump that up to thousands of attributes, which in turn provides greater detail and precision for customer-base segmentation and other customer analytics, both old and new.

Multi-structured data. Are you as sick of the “structured data versus unstructured data” comparison as I am? This tired construct doesn’t really work with big data, because it’s often a mix of structured, semi-structured, and unstructured data, plus gradations among these. I like the term “multi-structured data” (which I admit that I picked up from Teradata folks) because the term covers the whole range and it reminds us that big data is often a kind of mashup. To get full business value out of big data through analytics, more user organizations will invest in people skills and tools that span the full range of multi-structured data.

You will change your data warehouse architecture. At least, you will if you’re truly satisfying the requirements of big data analytics. Let’s be honest. Most EDWs are designed and optimized by their technical users for reporting, performance management, OLAP, and not much else. This is both a user design issue and a vendor platform issue. In recent years, I’ve seen tons of organizations rearchitect their EDWs (and sometimes swap platforms) to accommodate massive big data, multi-structured data, real-time big streams, and the demanding workloads of advanced analytics. This painful-but-necessary trend is long from over.

I’m stopping here because I’ve reached my target word count. And my growling stomach says it’s lunch time. But you get the idea. The business value of advanced analytics and the nuggets to be mined from big data have driven a lot of change recently, and will continue to do so throughout 2012.

SUGGESTED READING:
For a detailed discussion, see the TDWI Best Practices Report, titled Big Data Analytics, which is available in a PDF file via a free download.

You can also replay my TDWI Webinar, where I present the findings of the Big Data Analytics report.

For a discussion of similar issues, download the TDWI Checklist Report, titled Hadoop: Revealing Its True Value for Business Intelligence.

And you can replay last month’s TDWI Webinar, in which I led a panel of vendor representatives in a discussion of Hadoop and related technologies.

Philip Russom is the research director for data management at TDWI. You can reach him at [email protected] or follow him as @prussom on Twitter.

Posted by Philip Russom, Ph.D.0 comments

Big Data Analytics: An Overview in 20 Tweets

By Philip Russom, TDWI

To raise an awareness of the new tool features, user techniques, and team structures of Big Data Analytics, I recently issued a series of twenty tweets via Twitter, over a two-week period. The tweets also helped promote a TDWI Webinar on Big Data Analytics. Most of these tweets triggered responses to me or retweets. So I seem to have reached the business intelligence (BI) and data warehouse (DW) audience I was looking for – or at least touched a nerve!

To help you better understand Big Data Analytics and why you should care about it, I’d like to share some of the thoughts from these tweets with you. I think you’ll find them interesting because they provide an overview of Big Data Analytics in a form that’s compact, yet amazingly comprehensive.

Every tweet I wrote was a short sound bite or stat bite drawn from TDWI’s recent report on Big Data Analytics, which I researched and wrote. Many of the tweets focus on a statistic cited in the report, while other tweets are definitions stated in the report.

I left in the arcane acronyms, abbreviations, and incomplete sentences typical of tweets, because I think that all of you already know them or can figure them out. Even so, I deleted a few tiny URLs and repetitive phrases. I issued the tweets in groups, on related topics; so I’ve added some headings to this blog to show that organization. Otherwise, these are raw tweets.

Defining Big Data, Advanced Analytics, and Big Data Analytics
1. #BigData #Analytics = where advanced analytics operate on big data sets. So, it’s about 2 things. Learn more in Webinar http://bit.ly/qp4wp6
2. Advanced #Analytics = data mining, statistics, extreme SQL, data viz, artificial intell, language processing.
3. Advanced #Analytics = database techs like MapReduce, in-database & in-memory analytics, column stores.
4. Advanced #Analytics = discovering unknown biz facts. Instead of advanced, should call it discovery analytics
5. #BigData = not just multi-terabyte datasets. Also about diverse data types & real-time or streaming data.
6. Bleeding edge of #BigData = data streaming from sensors, robotics, monitor devices, Web logs.

Benefits and Barriers for Big Data Analytics
7. #TDWI SURVEY SEZ: #BigData #Analytics benefits customer relations, BI, most pre-existing analytic apps.
8. #TDWI SURVEY SEZ: Bad skills, sponsors, & database software are leading barriers to #BigData #Analytics.

Organizational Issues and Big Data Analytics
9. #TDWI SURVEY SEZ: 30% consider #BigData a data mgt problem. 70% think it a biz opp when analyzed. Attend #TDWI Webinar http://bit.ly/qp4wp6
10. #TDWI SURVEY SEZ: #BigData #Analytics is owned by BI/DW team (41%), dep’ts (21%), IT/CIO (12%).
11. #TDWI SURVEY SEZ: Business analyst is most common job title for designer of #BigData #Analytics.

The State of Big Data Analytics
12. #TDWI SURVEY SEZ: 74% of orgs have some form of analytics today. But only 34% do #BigData #Analytics.
13. #TDWI SURVEY SEZ: 37% of orgs have 10Tb+ of #BigData just for #Analytics. More on #TDWI Webinar http://bit.ly/qp4wp6
14. #TDWI SURVEY SEZ: 20% of orgs expect to have 500Tb+ of #BigData just for #Analytics by 2013.
15. #TDWI SURVEY SEZ: 64% of orgs today manage #BigData for #Analytics in EDW, 38% outside EDW.
16. #TDWI SURVEY SEZ: 24% claim to have Hadoop today. #TDWI suspects most are experimental downloads. But still impressive
17. #TDWI SURVEY SEZ: #BigData is struc 92%, semi-struc 54%, hier 54%, events 45%, unstruc 35%, social 34%, Web 31%...

Future Trends in Big Data Analytics
18. #TDWI SURVEY SEZ: 33% will replace #Analytics platform within 3 yrs. Another 11% after that. 9% already replaced.
19. #TDWI SURVEY SEZ: Why replace #Analytics platform? Poor scale, loading, query speed, real time, SOA, self service, viz.
20. #TDWI SURVEY SEZ: #BigData #Analytics techs set to grow most: advanced analytics, data viz, in-memory DBs, unstruc data

FOR FURTHER STUDY:
Don’t miss my next TDWI Webinar on Hadoop. I’ll lead a panel of vendor representatives in a discussion of Hadoop and its value for BI, DW, and analytics. Register online, so you can join us December 14, 2011 at noon ET.

For a more detailed discussion of Big Data Analytics – in a traditional publication! – see the TDWI Best Practices Report, titled Big Data Analytics, which is available in a PDF file via a free download.

You can also register for and replay my TDWI Webinar, where I present the findings of the Big Data Analytics report.

Philip Russom is the research director for data management at TDWI. You can reach him at [email protected] or follow him as @prussom on Twitter.

Posted by Philip Russom, Ph.D.0 comments

Big Data Analytics: The News from Informatica

Blog by Philip Russom
Research Director for Data Management, TDWI

Early this morning, Informatica Corporation announced Informatica HParser, a new product for parsing data in Apache Hadoop environments. Instead of repeating the details of the announcement – which you can read on www.informatica.com, etc. – I’d rather use the announcement as a springboard for my own thoughts about the bigger trends and issues in Big Data Analytics and Hadoop that the announcement fits into. The catch is that there are so many myths and misconceptions (i.e., “mythconceptions”) about Hadoop right now, that I can’t bust them all in a short piece like this blog. So I’ll just present the two leading mythconceptions as background, plus a brief rant for color.

First Mythconception. Hadoop is not one, monolithic thing, so we need to stop talking about it that way. It’s actually an open source software library administered by the Apache Software Foundation. (Some Hadoop products are also available via vendor distributions; but that’s another story.) The Apache Hadoop library includes several products and technologies, including (in BI priority order) the Hadoop Distributed File System (HDFS), MapReduce, Hive, Hbase, Pig, Zookeeper, Flume, Sqoop, Oozie, Hue, and so on. It’s up to you to figure out which combination of Apache Hadoop products to implement for a given application. For applications in business intelligence (BI) and Big Data Analytics, HDFS and MapReduce (perhaps with Hbase and Hive) constitute a useful technology stack.

Second Mythconception. Theoretically, HDFS can manage the storage and access of any data type, as long as you can put the data in a file and copy that file into HDFS. As outrageously simplistic as that sounds, it’s largely true, and it’s exactly what brings many users to Apache HDFS in the first place. Yet, HDFS’s admirable tolerance for diverse data doesn’t mean that an Apache Hadoop environment operates equally well with all file and data types. According to users I’ve interviewed, if you expect to get speed, scalability, and development simplicity, you need to work with Hadoop’s preference for record-based data. That’s not as limiting as it sounds, because many types of Big Data handled by HDFS are inherently record-based, as in logs from Web servers and sensors or table dumps of call detail records, customer records, transactions, etc. Furthermore, many sources of traditional enterprise data can be converted to records and copied to HDFS for Big Data Analytics and other applications.

And that brings us to Informatica Corporation’s announcement today of the new Informatica HParser. In a Hadoop environment, it’s MapReduce that actually executes the programmatic logic of an application. In the context of Big Data Analytics, the logic is (today) usually hand-coded data transformations or analytic logic. HParser provides an integrated development environment (IDE) for creating data transformation logic, plus ties into MapReduce to ensure that the logic executes in a fully distributed and parallel fashion. Given Apache Hadoop’s preference for record-based data, use cases cited by Informatica focus on how HParser can convert unstructured data into records and tables, plus flatten overly structured or “complex” data (as in the hierarchies of XML and JSON) into records that are more palatable to HDFS and Apache MapReduce. Record structures aside, Informatica HParser also supports a long list of data standards and document types. And Informatica PowerExchange for Hadoop provides additional functionality.

A brief rant. If you’ve been reading my writings on data integration for the last ten years, you know that I consider hand-coded data integration to be non-productive. Hand coding is time-consuming, not very re-usable, hard to update, and inherently feature-poor compared to vendor platforms. Now, we’re faced with Apache MapReduce, which – out of the box – demands huge amounts of hand coding, because it’s a processing engine that manages and provides parallelization for hand-coded routines (whether for analytics, DI, or otherwise). Informatica HParser shows promise for reducing the non-productive hand-coding that open-source environments like Hadoop, MapReduce, and Hive assume.

Conclusion. I feel that the men and women who’ve contributed to open source Hadoop have made an impressive and innovation contribution. And the Apache Software Foundation does a great job enabling the open source community. Thanks to these contributions, Hadoop is successfully used in production, but mostly in large, Internet-based businesses, like Amazon, Comscore, eBay, Google, and LinkedIn. However, for the Hadoop family – and the Big Data Analytics it enables – to become truly useful in a wide range of mainstream organizations across multiple industries, I think that the Hadoop family needs a number of new extensions, improvements, and options for interoperability.

This is why we’re now seeing software vendor companies coming out with various types of support for Apache Hadoop products and technologies. Informatica’s HParser and Informatica PowerExchange for Hadoop are prime examples, and other DI vendors will soon follow suit with similar interfaces and extensions for Hadoop. Some vendors are building administrative tools, which HDFS sorely lacks. And BI and analytic tool vendors are scrambling to sit atop HDFS and MapReduce. Personally, I hope to see more support for Hadoop and soon, because, without it, mainstream user organizations can’t get full value from Hadoop. Hence, they may not adopt it.

So, what do you think? Let me know!

===============================
Do you suffer mythconceptions about Hadoop? If so, TDWI can help you bust them:
• TDWI will soon publish my new Checklist Report on Hadoop, available as a free download on tdwi.org, starting Dec.13, 2011.
• On Dec.14, 2011, I’ll broadcast a TDWI Webinar based on that report. Please register online for the Hadoop Webinar.

Posted by Philip Russom, Ph.D.0 comments

Big Data Analytics: The News from Teradata

Blog by Philip Russom
Research Director for Data Management, TDWI

Just moments ago, Teradata Corporation issued three announcements describing new capabilities, products, and releases. Instead of repeating the details of Teradata’s new stuff -- which you can read on www.teradata.com, etc. -- I’d rather be self-indulgent and use each announcement as a springboard for my own thoughts about the bigger trends in Big Data Analytics these relate to.

Announcement Number One: Teradata Columnar

A few years ago, I was at the Teradata Partners Conference. Instead of attending speaking sessions, I was in a series of meetings for industry analysts and industry influencers. When the topic of columnar databases came up -- and it was my turn to pontificate -- I said something like: “Columnar storage engines will soon be available as just another feature of database management systems from larger, more established vendors.” The room fell quiet, and a cricket chirped in the background. Then, two experts mocked me, while Teradata people were noticeably mum. ;)

Does that make me a prescient visionary? No, not at all. I’ve just been paying attention for the last three decades, as one technology after the next is developed and proved by a small startup, then bought or built by one or more of the leading DBMS vendors. We’ve seen this trend played out with features for everything from security to parallel processing to OLAP to federation to in-memory databases. We’re now seeing the same trend with columnar data stores and other technologies for Big Data Analytics.

Newish vendors like ParAccel and Vertica -- and Sybase long before them -- have proved the usefulness and commercial potential of a columnar approach. Open source DBMSs MySQL and Infobright made similar contributions. In full compliance with the trend I’m describing, IBM and Oracle have released columnar storage engines they built, and now it’s Teradata’s turn. Teradata Columnar is a new capability of Teradata Database 14. What’s new here is that Teradata has integrated both columnar AND row-based tables, thereby making hybrid applications more feasible. All the above is goodness, regardless of vendor, because columnar data stores have compelling advantages for query speed, data compression, bla, bla, bla, and the usual miraculous benefits.

This recurring trend begs the question: What’s the next new innovation that’s on the path to DBMS assimilation? It’s obvious to me that Hadoop and MapReduce are already well down that path. And that brings us to the next Teradata announcement.

Announcement Number Two: Teradata Aster MapReduce Platform

On the upside, MapReduce is the secret sauce that brings advanced analytic capability to a big data repository, whether it’s Hadoop’s file system or a relational database management system (RDBMS). On the downside, MapReduce from most sources is mired in hand-coding and devoid of SQL (to which we’re hand-cuffed in BI). Hence, MapReduce shows great promise for the world of BI, but only if it can evolve to suit the technical requirements of BI and DW professionals.

Evolving MapReduce is what the small vendor Aster Data Systems has always been about, and the evolution continues now that Teradata has acquired Aster. First, Aster showed that MapReduce could be effective with an RDBMS – at least, with its own nCluster database, now called Aster Database 5.0. Aster then showed that MapReduce and SQL can be reconciled, and they received a patent for their innovation in this realm.

Let’s shift gears and look at data warehouse appliances. Despite the term “data warehouse” in the name, these are really “big data analytics appliances.” I say this based on the fact that at least 90% of DW appliance owners use them for multi-terabyte analytics, not data warehousing. Aster is now showing that a MapReduce-based RDBMS can be suited to an appliance, as in the new Aster MapReduce Appliance based on Teradata hardware.

I’ll say more about the evolution of MapReduce in a TDWI Webinar on October 27. Please register online and attend.

Announcement Number Three: Teradata Database 14

Most of the new functionality of Teradata Database 14 seems focused on making the system even more manageable and performable, especially in the context of multiple, diverse, concurrent data warehouse workloads.

The multiple workload problem is a thorny one. From the DW professional’s viewpoint, it’s not easy to optimize a data warehouse for several workloads; so most of EDWs are optimized for a short list of workloads. Since the primary deliverables of the average DW are reports (whether standard or dashboards) and OLAP, most EDW designers consciously decide to optimize for these. But that makes it difficult to add new workloads to a centralized enterprise data warehouse, so new workloads are often distributed to marts, operational data stores, and data staging areas outside the warehouse proper. Examples of “new workloads” include those for real time, detailed source data, non-structured data, and discovery or exploratory analytics (not OLAP).

How DW professionals and vendors are responding to the challenge of multiple workloads constitutes a trend. That’s because the responses affect data warehouse architecture, logical modeling, optimization, performance, platform selection, tool selection, selection of analytic methods, management strategies for big data, and so on.

Note that the multiple workload challenge is both a user design issue and a vendor platform capability issue. Yet, I think the former can win out over the latter. A good design on a weak platform can succeed, though you’ll probably end up with a heavily distributed DW architecture. Conversely a bad design on a strong platform can fail, especially if you expect the platform to be the design. Technology and design issues aside, I must also point out that the placement of a DW workload can be influenced by organizational issues, like sponsorship, funding, and compliance.

So, what do you think? Let me know!

===============================
Want to learn more about Big Data Analytics? Attend the TDWI Forum on Big Data Analytics for Business Insight. There's more information online.

Posted by Philip Russom, Ph.D.0 comments

It's All in the Memory: New Battleground for BI and Analytics

Where is the biggest battleground today in the business intelligence and analytics software market? On the technology front, one of the main battles is in the addressable memory space of systems that feature 64-bit computing and operating system platforms. The “in-memory” revolution is upon us, and no BI or analytics vendor wants to be left out. Large memory platforms will be critical to users working with tools for big data analytics, data discovery, data visualization, and more.

While the development of large-memory computing is not really new, it took a while for the software industry to adapt to 64-bit hardware processing and operating system platforms. Throw in the difficult learning curve for creating software to work with parallel processing, and it’s easy to see why the move from older systems has taken time. When large memory and parallel processing platforms were exotic, the slow pace of adaptation might have been acceptable. Now, with mainstream systems offering up to a terabyte of addressable memory, organizations can’t wait to try them out for BI and analytics.

Traditionally, designers of these systems have had to adjust to the limits of the I/O bottleneck. The preprocessing and design work for indexing and aggregating data has been necessary because of the performance constraints involved in getting data from disk through the I/O bottleneck. If large memory systems can ease or eliminate that constraint for the majority of users’ analysis needs, then the boundaries for analytics applications can be pushed out.

Users can perform “data discovery,” asking questions that lead to more questions, without as much concern for what this iterative, ad hoc style of investigation might mean to overall performance. Unlike with BI reports that simply update standard views of data, users can engage in exploratory data inquiries without knowing exactly where they will end up. Large-memory systems can offer volumes of detailed data on systems deployed closer to users. With the right tools, line-of-business (LOB) decision makers can dive into the data to test predictive models and perform fine-grained analysis on their own rather than wait for IT’s specialized business analysts and statisticians to do it for them.

Data discovery vendors such as QlikTech, Tableau, and TIBCO Spotfire have prospered by jumping first to seize market opportunities. However, the biggest coming battle may be between SAP and Oracle. Earlier this year, SAP introduced HANA, which competes with Oracle’s Exadata by offering in-memory analytics along with traditional disk-based storage in an appliance. Oracle has been readying a response, which will most likely come at Oracle Open World in early October and be aimed at taking in-memory capabilities for BI and analytics further. In the coming year, Oracle and SAP will battle to show which vendor is better at using analytics to increase the business value of ERP investments. In-memory capabilities will make it easier for these and other vendors to deploy rich analytics for ERP that are tailored to vertical industry and LOB requirements.

Large memory is not the whole story when it comes to the future of BI and analytics. However, it is a technology trend that users will notice firsthand through deeper, more visual, and more timely data analysis.

Posted by David Stodder0 comments

Advanced Analytics versus Online Analytic Processing (OLAP)

Blog by Philip Russom
Research Director for Data Management, TDWI

The current hype and hubbub around big data analytics has shifted our focus on what’s usually called “advanced analytics.” That’s an umbrella term for analytic techniques and tool types based on data mining, statistical analysis, or complex SQL – sometimes natural language processing and artificial intelligence, as well.

The term has been around since the late 1990s, so you’d think I’d get used to it. But I have to admit that the term “advanced analytics” rubs me the wrong way for two reasons:

First, it’s not a good description of what users are doing or what the technology does. Instead of “advanced analytics,” a better term would be “discovery analytics,” because that’s what users are doing. Or we could call it “exploratory analytics.” In other words, the user is typically a business analyst who is exploring data broadly to discover new business facts that no one in the enterprise knew before. These facts can then be turned into an analytic model or some equivalent for tracking over time.

Second, the thing that chaffs me most is that the way the term “advanced analytics” has been applied for fifteen years excludes online analytic processing (OLAP). Huh!? Does that mean that OLAP is “primitive analytics”? Is OLAP somehow incapable of being advanced?

I personally don’t think so. In fact, depending on how you design and implement it, OLAP can be quite advanced. For example, OLAP is very much about dimensions. In the 90s, eight dimensions was considered an advanced implementation. Nowadays I regularly talk with people who have twenty or more. I realize there’s a difference between advanced and mature. But I have to say that I’ve seen lots of mature OLAP implementations that support hundreds of cubes, hundreds of OLAP reports, and thousands of users. Over the years, different approaches to OLAP (multidimensional, relational, desktop, etc.) have consolidated into a hybrid OLAP, such that most vendor products today are quite mature, feature rich, and flexible.

Here’s another, related issue. While researching a new TDWI report on big data analytics, I ran across a few people (users, consultants, and vendors) who think that “advanced analytics” (or whatever you want to call it) will render OLAP obsolete. Therefore, user organizations should expunge OLAP from their BI portfolios. Uh, no. I don’t see that happening.

In defense of OLAP, it’s by far the most common form of analytics in BI today, and for good reasons. Once you get used to multidimensional thinking, OLAP is very natural, because most business questions are themselves multidimensional. For example, “What are western region sales revenues in Q4 2010?” intersects dimensions for geography, function, money, and time. Discoveries made in OLAP are easily “institutionalized” or “operationalized” (much more so than advanced analytics), so OLAP analyses are repeated over time with consistency. Since dimensions are easily expressed as parameters, an OLAP-based report can be as easy to use as a parameterized report, thereby putting OLAP-based analytics within the comprehension of a vast range of possible end-users.

The scope of discovery of an analytic method seems to be an important concern right now, as seen the current fascination with big data analytics. In that context, a possible limitation of OLAP is that most implementations are tightly coupled to datasets called cubes. If the information someone hopes to discover is not in a cube, then that can be a problem. Even so, so-called relational OLAP can be a solution, and OLAP tools are so friendly nowadays that just about anyone can create a cube. Depending on how an OLAP implementation is designed and which vendor tools are used, a cube can limit the scope of discovery, just as any analytic dataset can – even if it’s multi-terabyte big data.

In my mind, advanced analytics is very much about open-ended exploration and discovery in large volumes of fairly raw source data. But OLAP is about a more controlled discovery of combinations of carefully prepared dimensional datasets. The way I see it: a cube is a closed system that enables combinatorial analytics. Given the richness of cubes users are designing nowadays, there’s a gargantuan number of combinations for a wide range of users to explore.

So, OLAP’s not going away. Users would be nuts to abandon their large investments in such a handy technology. And it’s like most situations in IT. Few things go away. Organizations just keep adding more tools types and best practices to their portfolios. Therefore, user organizations should expect to maintain their useful investments in OLAP, while also digging deeper into other forms of exploratory and discovery analytics.

So, what do you think, folks? Let me know. Thanks!

Posted by Philip Russom, Ph.D.0 comments

Big Data Analytics: Avoid the Analytic Cul-De-Sac

Blog by Philip Russom
Research Director for Data Management, TDWI

Do you know what a cul-de-sac is? In French, it literally means “bottom of the bag.” But figuratively it means what most Americans would call a “dead-end street.” In residential real estate, a cul-de-sac is a desirable place to live. In analytics, a cul-de-sac is where the epiphanies of advanced analytics never get off a dead-end street to be fully leveraged elsewhere in the enterprise.

The current hype around big data analytics has most discussions of analytics focused on “discovery” analytics. That’s where a business analyst or similar user employs an advanced analytics tool (based on data mining, statistics, natural language processing, complex SQL, etc.) to discover facts never known before. For example, the analyst may discover the root cause for a new form of customer churn, a new partner behavior that’s potentially fraudulent, or the hidden costs that erode otherwise profitable customers.

While researching a new TDWI report on big data analytics, I’ve run across a number of business analysts who revel in the chase around the cul de sac, but can’t be bothered with operationalizing their epiphanies. “That’s someone else’s job,” one guy told me. Here’s what I mean.

Too often analysts drive through a figurative big data “bottom of the bag,” until just the right dataset yields an epiphany. Then they share their findings with managers and move on to the next analytic project.

This is an analytic cul-de-sac, when the analyst does not also take the findings off the dead-end street and “operationalize” them. In other words, once you discover the new form of churn, analytic models, metrics, reports, warehouse data, and so on need to be updated, so the appropriate managers can easily spot the churn and do something about quickly, if it returns. Likewise, hidden costs, once revealed, should be operationalized in analytics (and possibly reports and warehouses), so managers can better track and study costs over time, to keep them down.

I think that most analysts and similar users are avoiding analytic cul-de-sacs, by being sure that discovered epiphanies are operationalized by someone (whether by the actual analyst or another team member). I’m just saying that the product of analytics isn’t necessarily being leveraged to the hilt in every organization.

To avoid analytic cul-de-sacs and similar squanderings of insight, you might want to review some of the processes around your use of advanced analytics. In particular, be sure the process extends beyond discovery into operationalizing the epiphanies of analytics.

So, what do you think, folks? Let me know. Thanks!

Posted by Philip Russom, Ph.D.0 comments

Big Data Analytics: Preparing Analytic Data Differs from ETL for Data Warehousing

Blog by Philip Russom
Research Director for Data Management, TDWI

While researching a new TDWI report on big data analytics, I’ve run across a few BI professionals who are concerned about the seeming lack of data preparation that’s common with some forms of advanced analytics. Allow me a moment to sort this out.

On the one hand, all of us in BI and data warehousing are indoctrinated to believe that the data of an enterprise data warehouse (EDW) (and hence the data that feeds into reports) must be absolutely pristine, integrated and aggregated properly, well-documented, and modeled for optimization. To achieve these data requirements, BI teams work hard on extract, transform, and load (ETL), data quality (DQ), meta and master data management (MDM), and data modeling. These data preparation best practices make perfect sense for the vast majority of the reports, dashboards, and OLAP-based analyses that are refreshed from data warehouse data. For those products of BI, we want to use only well-understood data that’s brought as close to perfection as possible. And many of these become public documents, where problems with data could be dire for a business.

On the other hand, preparing data for advanced analytics requires very different best practices – especially when big data is involved. The product of advanced analytics is insight, typically an insight about bottom-line costs or customer churn or fraud or risk. These kinds of insights are never made public, and the analytic data they’re typically based on doesn’t have the reuse and publication requirements that data warehouse data has. Therefore, big data for advanced analytics rarely needs the full brace of ETL, data quality, metadata, and modeling we associate with data from an EDW.

In fact, if you bring to bear the full arsenal of data prep practices on analytic datasets, you run the risk of reducing its analytic value. This is ironic, because we usually think of ETL, DQ, and data modeling as adding value to data, not subtracting it. So, how can they harm analytic data?

To answer that question, let’s first take a look at so-called “advanced analytics.” This collection of analytic techniques would be better called “discovery analytics,” because that’s what users do with it. A business analyst or similar user applies techniques like data mining, statistical analysis, complex SQL, MapReduce, and natural language processing to discovery facts about the business that no one knew before. For example, you might discover the root cause of the latest form of customer churn. Or you might find a cluster of transactions that indicate a new kind of fraud. Or you could stumble onto an untapped customer segment.

In general, you can’t discover those entities and facts from the overly studied, calculated, modeled, and aggregated data of an EDW. Instead, you need big data, with lots of granular detail, typically in the schema of the source systems it came from. Some forms of analytics actually thrive on questionable data in poor condition. For example, analytic applications for fraud detection may depend on outliers and non-standard data as indications of fraud. And the insights of discovery analytics often focus in narrow slices of the business, like an obscure customer segment, or time frame or group of shipments or transaction types or risky neighborhood. These thin slices can easily disappear in an aggregation pass. Hence, if you apply ETL and DQ processes to big data, as you do for a data warehouse, you run the risk of stripping out the very nuggets that make big data a treasure trove for discovery oriented advanced analytics. This is why the preparation of data for discovery analytics seems minimal (even slipshod) – often just extracts and table joins – compared to the full range of data prep applied to EDW data.

Does this mean that we can throw out the sacrosanct best practices for ETL, DQ, metadata, MDM, and data modeling? No, of course not. Some organization will simply need to suspend these for discovery analytics with big data—but only temporarily. Here’s a typical scenario.

After business analysts and other users have discovered what they’re looking for in big data, they need to take the discovery to the BI and DW team, so the results can be “institutionalized” in the EDW. For example, when discovery analytics reveals valuable items – like new forms of churn, customer segments, cost centers, etc. – these need to be represented by data structures in the EDW and reports, so that business people can track them regularly. At that point, the best practices of data preparation come back into play.

So, what do you think, folks? Let me know. Thanks!

Posted by Philip Russom, Ph.D.0 comments

Big Data Analytics: The View from SAP

Blog by Philip Russom
Research Director for Data Management, TDWI

A few weeks ago, I talked with Mike Eacrett, the vice president of product management for SAP HANA at SAP Labs. Among other things, Mike explained the “secret sauce” that gives SAP HANA flexibility and performance for big data analytics. Give me a moment to recount Mike’s explanation.

Philip Russom: What forms of analytics are you seeing on the rise with SAP customers?

Mike Eacrett: SAP customers continue to expand their investments in online analytic processing (OLAP). But the explosive growth is with exploratory analytics. That’s where a business user needs to learn things that he/she didn’t know to ask before. Or they need to see patterns or the absence of them in the data, typically in response to a change in the business or customer behavior. This kind of exploration requires big data, typically in its original source schema with all its details intact. Instead of transforming and cleansing the data prior to analysis (which can lose desirable data details), the user iteratively develops queries that manipulate data at the analytic tool level, not the physical storage level, as you would when, say, modeling a data warehouse.

Philip Russom: I’m familiar with this analytic method, so I know that it requires a hefty platform for big data analytics. What is SAP offering in this regard?

Mike Eacrett: We offer the SAP In-Memory Computing Appliance, otherwise known as SAP HANA. It’s an enterprise software architecture that enables analytic queries to run against detailed source data—and run fast in real time—without need for transforming the data into data models optimized for a specific type of analysis. To achieve this, SAP HANA implements its own massively parallel distributed processing method (similar to some of the concepts of MapReduce), based on HANA’s in-memory database, running code that utilizes the instruction set and vector processing capabilities of Intel chip sets. That means that the SAP user needn’t define analytic queries months in advance, then wait for IT to model data for them. All the data is available at their fingertips in memory. HANA gives logical data modeling a new twist, so that the analyst user can run queries as fast as he or she thinks them up, and without being limited by data models, data movement, and pre-aggregation constraints.

Philip Russom: You mentioned that SAP HANA gives logical data modeling a new twist. What do you mean?

Mike Eacrett: The term for this new technique is “logical data marting.” It assumes that all the operational source data needed for analytics present in SAP modules is also available in SAP HANA. A logical data model of a data mart is constructed in server memory, based on an analytic query that’s being executed. In SAP HANA-based applications, the same data model is used for online transactional (OLTP) and analytics – in other words, the data marts are a logical view of one persistence layer. The logical model draws data from modules’ underlying memory persisted tables, as needed by queries. As an analyst or HANA-based application iteratively redefines a query, the model automatically redraws itself, using analytic and calculation views. The logical model (based on queries against the pre-built SAP business content) liberates analysts from cumbersome data modeling, and the in-memory processing gives it true real-time speed.

So, what do you think, folks? Let me know. Thanks!

Posted by Philip Russom, Ph.D.0 comments