TDWI Blog

Philip RussomPhilip Russom, Ph.D., is senior director of TDWI Research for data management and is a well-known figure in data warehousing, integration, and quality, having published over 550 research reports, magazine articles, opinion columns, and speeches over a 20-year period. Before joining TDWI in 2005, Russom was an industry analyst covering data management at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and consultant, was a contributing editor with leading IT magazines, and a product manager at database vendors. His Ph.D. is from Yale. You can reach him by email ([email protected]), on Twitter (twitter.com/prussom), and on LinkedIn (linkedin.com/in/philiprussom).


Advanced Analytics versus Online Analytic Processing (OLAP)

Blog by Philip Russom
Research Director for Data Management, TDWI

The current hype and hubbub around big data analytics has shifted our focus on what’s usually called “advanced analytics.” That’s an umbrella term for analytic techniques and tool types based on data mining, statistical analysis, or complex SQL – sometimes natural language processing and artificial intelligence, as well.

The term has been around since the late 1990s, so you’d think I’d get used to it. But I have to admit that the term “advanced analytics” rubs me the wrong way for two reasons:

First, it’s not a good description of what users are doing or what the technology does. Instead of “advanced analytics,” a better term would be “discovery analytics,” because that’s what users are doing. Or we could call it “exploratory analytics.” In other words, the user is typically a business analyst who is exploring data broadly to discover new business facts that no one in the enterprise knew before. These facts can then be turned into an analytic model or some equivalent for tracking over time.

Second, the thing that chaffs me most is that the way the term “advanced analytics” has been applied for fifteen years excludes online analytic processing (OLAP). Huh!? Does that mean that OLAP is “primitive analytics”? Is OLAP somehow incapable of being advanced?

I personally don’t think so. In fact, depending on how you design and implement it, OLAP can be quite advanced. For example, OLAP is very much about dimensions. In the 90s, eight dimensions was considered an advanced implementation. Nowadays I regularly talk with people who have twenty or more. I realize there’s a difference between advanced and mature. But I have to say that I’ve seen lots of mature OLAP implementations that support hundreds of cubes, hundreds of OLAP reports, and thousands of users. Over the years, different approaches to OLAP (multidimensional, relational, desktop, etc.) have consolidated into a hybrid OLAP, such that most vendor products today are quite mature, feature rich, and flexible.

Here’s another, related issue. While researching a new TDWI report on big data analytics, I ran across a few people (users, consultants, and vendors) who think that “advanced analytics” (or whatever you want to call it) will render OLAP obsolete. Therefore, user organizations should expunge OLAP from their BI portfolios. Uh, no. I don’t see that happening.

In defense of OLAP, it’s by far the most common form of analytics in BI today, and for good reasons. Once you get used to multidimensional thinking, OLAP is very natural, because most business questions are themselves multidimensional. For example, “What are western region sales revenues in Q4 2010?” intersects dimensions for geography, function, money, and time. Discoveries made in OLAP are easily “institutionalized” or “operationalized” (much more so than advanced analytics), so OLAP analyses are repeated over time with consistency. Since dimensions are easily expressed as parameters, an OLAP-based report can be as easy to use as a parameterized report, thereby putting OLAP-based analytics within the comprehension of a vast range of possible end-users.

The scope of discovery of an analytic method seems to be an important concern right now, as seen the current fascination with big data analytics. In that context, a possible limitation of OLAP is that most implementations are tightly coupled to datasets called cubes. If the information someone hopes to discover is not in a cube, then that can be a problem. Even so, so-called relational OLAP can be a solution, and OLAP tools are so friendly nowadays that just about anyone can create a cube. Depending on how an OLAP implementation is designed and which vendor tools are used, a cube can limit the scope of discovery, just as any analytic dataset can – even if it’s multi-terabyte big data.

In my mind, advanced analytics is very much about open-ended exploration and discovery in large volumes of fairly raw source data. But OLAP is about a more controlled discovery of combinations of carefully prepared dimensional datasets. The way I see it: a cube is a closed system that enables combinatorial analytics. Given the richness of cubes users are designing nowadays, there’s a gargantuan number of combinations for a wide range of users to explore.

So, OLAP’s not going away. Users would be nuts to abandon their large investments in such a handy technology. And it’s like most situations in IT. Few things go away. Organizations just keep adding more tools types and best practices to their portfolios. Therefore, user organizations should expect to maintain their useful investments in OLAP, while also digging deeper into other forms of exploratory and discovery analytics.

So, what do you think, folks? Let me know. Thanks!

Posted by Philip Russom, Ph.D. on August 5, 20110 comments


Big Data Analytics: Avoid the Analytic Cul-De-Sac

Blog by Philip Russom
Research Director for Data Management, TDWI

Do you know what a cul-de-sac is? In French, it literally means “bottom of the bag.” But figuratively it means what most Americans would call a “dead-end street.” In residential real estate, a cul-de-sac is a desirable place to live. In analytics, a cul-de-sac is where the epiphanies of advanced analytics never get off a dead-end street to be fully leveraged elsewhere in the enterprise.

The current hype around big data analytics has most discussions of analytics focused on “discovery” analytics. That’s where a business analyst or similar user employs an advanced analytics tool (based on data mining, statistics, natural language processing, complex SQL, etc.) to discover facts never known before. For example, the analyst may discover the root cause for a new form of customer churn, a new partner behavior that’s potentially fraudulent, or the hidden costs that erode otherwise profitable customers.

While researching a new TDWI report on big data analytics, I’ve run across a number of business analysts who revel in the chase around the cul de sac, but can’t be bothered with operationalizing their epiphanies. “That’s someone else’s job,” one guy told me. Here’s what I mean.

Too often analysts drive through a figurative big data “bottom of the bag,” until just the right dataset yields an epiphany. Then they share their findings with managers and move on to the next analytic project.

This is an analytic cul-de-sac, when the analyst does not also take the findings off the dead-end street and “operationalize” them. In other words, once you discover the new form of churn, analytic models, metrics, reports, warehouse data, and so on need to be updated, so the appropriate managers can easily spot the churn and do something about quickly, if it returns. Likewise, hidden costs, once revealed, should be operationalized in analytics (and possibly reports and warehouses), so managers can better track and study costs over time, to keep them down.

I think that most analysts and similar users are avoiding analytic cul-de-sacs, by being sure that discovered epiphanies are operationalized by someone (whether by the actual analyst or another team member). I’m just saying that the product of analytics isn’t necessarily being leveraged to the hilt in every organization.

To avoid analytic cul-de-sacs and similar squanderings of insight, you might want to review some of the processes around your use of advanced analytics. In particular, be sure the process extends beyond discovery into operationalizing the epiphanies of analytics.

So, what do you think, folks? Let me know. Thanks!

Posted by Philip Russom, Ph.D. on July 21, 20110 comments


Big Data Analytics: Preparing Analytic Data Differs from ETL for Data Warehousing

Blog by Philip Russom
Research Director for Data Management, TDWI

While researching a new TDWI report on big data analytics, I’ve run across a few BI professionals who are concerned about the seeming lack of data preparation that’s common with some forms of advanced analytics. Allow me a moment to sort this out.

On the one hand, all of us in BI and data warehousing are indoctrinated to believe that the data of an enterprise data warehouse (EDW) (and hence the data that feeds into reports) must be absolutely pristine, integrated and aggregated properly, well-documented, and modeled for optimization. To achieve these data requirements, BI teams work hard on extract, transform, and load (ETL), data quality (DQ), meta and master data management (MDM), and data modeling. These data preparation best practices make perfect sense for the vast majority of the reports, dashboards, and OLAP-based analyses that are refreshed from data warehouse data. For those products of BI, we want to use only well-understood data that’s brought as close to perfection as possible. And many of these become public documents, where problems with data could be dire for a business.

On the other hand, preparing data for advanced analytics requires very different best practices – especially when big data is involved. The product of advanced analytics is insight, typically an insight about bottom-line costs or customer churn or fraud or risk. These kinds of insights are never made public, and the analytic data they’re typically based on doesn’t have the reuse and publication requirements that data warehouse data has. Therefore, big data for advanced analytics rarely needs the full brace of ETL, data quality, metadata, and modeling we associate with data from an EDW.

In fact, if you bring to bear the full arsenal of data prep practices on analytic datasets, you run the risk of reducing its analytic value. This is ironic, because we usually think of ETL, DQ, and data modeling as adding value to data, not subtracting it. So, how can they harm analytic data?

To answer that question, let’s first take a look at so-called “advanced analytics.” This collection of analytic techniques would be better called “discovery analytics,” because that’s what users do with it. A business analyst or similar user applies techniques like data mining, statistical analysis, complex SQL, MapReduce, and natural language processing to discovery facts about the business that no one knew before. For example, you might discover the root cause of the latest form of customer churn. Or you might find a cluster of transactions that indicate a new kind of fraud. Or you could stumble onto an untapped customer segment.

In general, you can’t discover those entities and facts from the overly studied, calculated, modeled, and aggregated data of an EDW. Instead, you need big data, with lots of granular detail, typically in the schema of the source systems it came from. Some forms of analytics actually thrive on questionable data in poor condition. For example, analytic applications for fraud detection may depend on outliers and non-standard data as indications of fraud. And the insights of discovery analytics often focus in narrow slices of the business, like an obscure customer segment, or time frame or group of shipments or transaction types or risky neighborhood. These thin slices can easily disappear in an aggregation pass. Hence, if you apply ETL and DQ processes to big data, as you do for a data warehouse, you run the risk of stripping out the very nuggets that make big data a treasure trove for discovery oriented advanced analytics. This is why the preparation of data for discovery analytics seems minimal (even slipshod) – often just extracts and table joins – compared to the full range of data prep applied to EDW data.

Does this mean that we can throw out the sacrosanct best practices for ETL, DQ, metadata, MDM, and data modeling? No, of course not. Some organization will simply need to suspend these for discovery analytics with big data—but only temporarily. Here’s a typical scenario.

After business analysts and other users have discovered what they’re looking for in big data, they need to take the discovery to the BI and DW team, so the results can be “institutionalized” in the EDW. For example, when discovery analytics reveals valuable items – like new forms of churn, customer segments, cost centers, etc. – these need to be represented by data structures in the EDW and reports, so that business people can track them regularly. At that point, the best practices of data preparation come back into play.

So, what do you think, folks? Let me know. Thanks!

Posted by Philip Russom, Ph.D. on July 12, 20110 comments


Big Data Analytics: The View from SAP

Blog by Philip Russom
Research Director for Data Management, TDWI

A few weeks ago, I talked with Mike Eacrett, the vice president of product management for SAP HANA at SAP Labs. Among other things, Mike explained the “secret sauce” that gives SAP HANA flexibility and performance for big data analytics. Give me a moment to recount Mike’s explanation.

Philip Russom: What forms of analytics are you seeing on the rise with SAP customers?

Mike Eacrett: SAP customers continue to expand their investments in online analytic processing (OLAP). But the explosive growth is with exploratory analytics. That’s where a business user needs to learn things that he/she didn’t know to ask before. Or they need to see patterns or the absence of them in the data, typically in response to a change in the business or customer behavior. This kind of exploration requires big data, typically in its original source schema with all its details intact. Instead of transforming and cleansing the data prior to analysis (which can lose desirable data details), the user iteratively develops queries that manipulate data at the analytic tool level, not the physical storage level, as you would when, say, modeling a data warehouse.

Philip Russom: I’m familiar with this analytic method, so I know that it requires a hefty platform for big data analytics. What is SAP offering in this regard?

Mike Eacrett: We offer the SAP In-Memory Computing Appliance, otherwise known as SAP HANA. It’s an enterprise software architecture that enables analytic queries to run against detailed source data—and run fast in real time—without need for transforming the data into data models optimized for a specific type of analysis. To achieve this, SAP HANA implements its own massively parallel distributed processing method (similar to some of the concepts of MapReduce), based on HANA’s in-memory database, running code that utilizes the instruction set and vector processing capabilities of Intel chip sets. That means that the SAP user needn’t define analytic queries months in advance, then wait for IT to model data for them. All the data is available at their fingertips in memory. HANA gives logical data modeling a new twist, so that the analyst user can run queries as fast as he or she thinks them up, and without being limited by data models, data movement, and pre-aggregation constraints.

Philip Russom: You mentioned that SAP HANA gives logical data modeling a new twist. What do you mean?

Mike Eacrett: The term for this new technique is “logical data marting.” It assumes that all the operational source data needed for analytics present in SAP modules is also available in SAP HANA. A logical data model of a data mart is constructed in server memory, based on an analytic query that’s being executed. In SAP HANA-based applications, the same data model is used for online transactional (OLTP) and analytics – in other words, the data marts are a logical view of one persistence layer. The logical model draws data from modules’ underlying memory persisted tables, as needed by queries. As an analyst or HANA-based application iteratively redefines a query, the model automatically redraws itself, using analytic and calculation views. The logical model (based on queries against the pre-built SAP business content) liberates analysts from cumbersome data modeling, and the in-memory processing gives it true real-time speed.

So, what do you think, folks? Let me know. Thanks!

Posted by Philip Russom, Ph.D. on June 27, 20110 comments


Big Data Analytics: Frequently Asked Questions (FAQ)

Blog by Philip Russom
Research Director for Data Management, TDWI

What exactly is Big Data Analytics?

It’s two things: big data and the kind of analytics users want to do with big data. Let’s start with big data, then come back to analytics.

Users interviewed by TDWI state that data isn’t big until it breaks 10Tb. So that’s the low end of big data. And some user organizations have cached away hundreds of terabytes--just for analytics. The size of big data is relative; hundreds of TBs isn’t new, but hundred just for analytics is—at least, for most user organizations.

Big Data is all about multi-terabyte datasets, right?

No, there’s more to it than that. Size aside, there are other ways to define big data. In particular, big data tends to be diverse, and it’s the diversity that drives up the data volume. For example, analytic methods that are on the rise need to correlate data points drawn from many sources, both in the enterprise and outside it. Furthermore, one of the new things about analytics is that it’s NOT just based on structured data, but on unstructured data (like human language text) and semi-structured data (like XML files, RSS feeds), and data derived from audio and video. Again, the diversity of data types drives up data volume.

Finally, big data can be defined by its velocity or speed. This may also be defined by the frequency of data generation. For example, think of the stream of data coming off of any kind of sensor, say thermometers sensing temperature, microphones listening for movement in a secure area, or video cameras scanning for a specific face in a crowd. With sensor data flying at you relentlessly in real time, data volumes get big in a hurry. Even more challenging, the analytics that go with streaming data have to make sense of the data and possibly take action—all in real time.

Hence, big data is more than large datasets. It’s also about diverse data sources or data types (and these may be arriving at various speeds), plus the challenges of analyzing data in these demanding circumstances.

What kinds of analytics go with big data?

The kind of analytics applied to big data is often called “advanced analytics.” A better term would be “discovery analytics” because that’s what users are trying to accomplish. In other words, with big data analytics, the user is typically a business analyst who is trying to discover new business facts that no one in the enterprise knew before. To do that, you need large volumes of data that has a lot of details. And this is usually data that the enterprise has not tapped for analytics. For example, in the middle of the recent economic recession, companies were constantly being hit by new forms of customer churn. To discover the root cause of the newest form of churn, a business analyst grabs several terabytes of detailed data drawn from operational applications to get a view of recent customer behaviors. He may mix that data with historic data from a data warehouse. Dozens of queries later, he’s discovered a new churn behavior in a subset of the customer base. With any luck, he’ll turn that information into an analytic model, with which the company can track and predict the new form of churn.

What kind of analytic tool does a business analyst need for the “discovery analytics” that’s common with big data?

Discovery analytics against big data can be enabled by different types of analytic tools, including those based on SQL queries, data mining, statistical analysis, fact clustering, data visualization, natural language processing, text analytics, artificial intelligence, and so one. It’s quite an arsenal of tool types, and savvy users get to know their analytic requirements first before deciding which tool type is appropriate to their needs.

Is big data a problem just to be managed (with its size, diversity, and speed) or is it an opportunity to be seized?

TDWI is currently running an Internet-based survey about big data analytics. An early extraction of survey data shows that only 30% of users responding to the survey are concerned about the technical challenges of collecting and managing big data. The vast majority – namely 70% percent of the users responding to the survey – say that big data is definitely an opportunity. That’s because through analysis the user organization can discovery new facts about their customers, markets, partners, costs, and operations, then use that information for business advantage.

So, what do you think, folks? Let me know. Thanks!

========================================
Don’t miss TDWI’s Big Data Analytics Survey. Please share your opinions and experiences by taking the online survey.

Posted by Philip Russom, Ph.D. on June 21, 20110 comments


The Three Vs of Big Data Analytics: VELOCITY

Blog by Philip Russom
Research Director for Data Management, TDWI

In prior blogs, I’ve talked about how big data’s primary attribute is data volume. That’s pretty obvious. But it’s defined by other characteristics, too. For example, one of the things that makes big data so big is that it’s coming from a greater variety of sources than ever before. Now let’s look at the last of the three Vs of Big Data Analytics, namely data velocity.

Data Feed Velocity as a defining attribute of Big Data

Big data can be described by its velocity or speed. Or you may prefer to think of it as the frequency of data generation or frequency of data delivery. For example, think of the stream of data coming off of any kind of sensor, say thermometers sensing temperature, microphones listening for movement in a secure area, or video cameras scanning for a specific face in a crowd. This isn’t new; many firms have been collecting click stream data off of Web sites for years, using streaming data to make purchase recommendations to Web visitors. With sensor and Web data flying at you relentlessly in real time, data volumes get big in a hurry. Even more challenging, the analytics that go with streaming data have to make sense of the data and possibly take action—all in real time.

So you don’t think this is all science fiction, allow me to share some of the use cases for high-velocity data feeds and streams that I’ve heard recently.

Here’s an unsubstantiated anecdote that someone told me: “There’s a cell service provider in Japan that collects GPS data from cell phone users. The cell provider collects the data in real time, and keeps track of which people are walking the furthest. Once a month, the cell provider gives an award to the walker who covered the greatest distance. In a way, cell phones are working like sensors to collect and analyze streaming big data.”

I also heard a similar anecdote: “Imagine that I’m a consumer walking around downtown in a city, and I’m shopping. Now imagine letting a shopping service know where I am, plus maybe the kinds of goods I’m looking for. As I walk, the GPS coordinates could stream to the shopping service, and it could point me to stores that match my interests.”

A consultant who specializes in streaming data told me about some video and audio analytic applications he’s looking into: “Think about the algorithms that enable us to parse text and perform sentiment analysis, sometimes in real time. Very similar algorithms can parse video images to document and analyze changes in the thing that’s being imaged. Satellite images could monitor and analyze troop movements, a flood plane, cloud patterns, and grass fires. Or a video analysis system could monitor a sensitive or valuable facility, watching for possible intruders, then alert authorities in real time. You can implement similar applications with sound monitoring; one of my apps involves two thousand underground microphones to listen for movement in geologic formations. I hope it can eventually help predict earthquakes.”

Here’s a related user story about streaming big data that I heard recently: “You don’t need all of the streaming data. You just need the interesting pieces or just the one piece that identifies what you’re looking for. We’ve all seen video footage from the US military’s unmanned jet drones. A drone is processing several frames of video per second looking for shapes or light signatures that match its programming. For example, it might be looking for shapes that look like tanks or sun reflections that could come from metallic weapons. The drone deletes almost all of the frames, because they’re not of interest. And that helps avoid data glut that could choke the system.”

A prominent Internet-based business told me a few weeks ago: “We load 200 gigabytes a day into our data warehouse. But that’s processed down from several terabytes of Web log and click-stream data. We mix this big data with data about our customers drawn from other touch points, then analyze it. Although the data is streaming, we collect the stream on disk, then process it down and analyze it over night. Our next step is to process and analyze streaming big data in real time. We’re definitely a customer-oriented business, so understanding customers and serving them better is the goal of analytics. We just need to do it both after the fact in batch and – eventually – in real time.”

So, what do you think, folks? Let me know. Thanks!

========================================
This blog is number 3 in a series of 3, all about the three Vs of big data analytics, namely data volume, variety, and velocity. You can read the first blog here. And you can read the second blog here.

Don’t miss TDWI’s Big Data Analytics Survey. Please share your opinions and experiences by taking the online survey.

Posted by Philip Russom, Ph.D. on June 17, 20110 comments