Blog by Philip Russom
Research Director for Data Management, TDWI
A few weeks ago, I talked with Mike Eacrett, the vice president of product management for SAP HANA at SAP Labs. Among other things, Mike explained the “secret sauce” that gives SAP HANA flexibility and performance for big data analytics. Give me a moment to recount Mike’s explanation.
Philip Russom: What forms of analytics are you seeing on the rise with SAP customers?
Mike Eacrett: SAP customers continue to expand their investments in online analytic processing (OLAP). But the explosive growth is with exploratory analytics. That’s where a business user needs to learn things that he/she didn’t know to ask before. Or they need to see patterns or the absence of them in the data, typically in response to a change in the business or customer behavior. This kind of exploration requires big data, typically in its original source schema with all its details intact. Instead of transforming and cleansing the data prior to analysis (which can lose desirable data details), the user iteratively develops queries that manipulate data at the analytic tool level, not the physical storage level, as you would when, say, modeling a data warehouse.
Philip Russom: I’m familiar with this analytic method, so I know that it requires a hefty platform for big data analytics. What is SAP offering in this regard?
Mike Eacrett: We offer the SAP In-Memory Computing Appliance, otherwise known as SAP HANA. It’s an enterprise software architecture that enables analytic queries to run against detailed source data—and run fast in real time—without need for transforming the data into data models optimized for a specific type of analysis. To achieve this, SAP HANA implements its own massively parallel distributed processing method (similar to some of the concepts of MapReduce), based on HANA’s in-memory database, running code that utilizes the instruction set and vector processing capabilities of Intel chip sets. That means that the SAP user needn’t define analytic queries months in advance, then wait for IT to model data for them. All the data is available at their fingertips in memory. HANA gives logical data modeling a new twist, so that the analyst user can run queries as fast as he or she thinks them up, and without being limited by data models, data movement, and pre-aggregation constraints.
Philip Russom: You mentioned that SAP HANA gives logical data modeling a new twist. What do you mean?
Mike Eacrett: The term for this new technique is “logical data marting.” It assumes that all the operational source data needed for analytics present in SAP modules is also available in SAP HANA. A logical data model of a data mart is constructed in server memory, based on an analytic query that’s being executed. In SAP HANA-based applications, the same data model is used for online transactional (OLTP) and analytics – in other words, the data marts are a logical view of one persistence layer. The logical model draws data from modules’ underlying memory persisted tables, as needed by queries. As an analyst or HANA-based application iteratively redefines a query, the model automatically redraws itself, using analytic and calculation views. The logical model (based on queries against the pre-built SAP business content) liberates analysts from cumbersome data modeling, and the in-memory processing gives it true real-time speed.
So, what do you think, folks? Let me know. Thanks!
Posted by Philip Russom, Ph.D. on June 27, 20110 comments
Blog by Philip Russom
Research Director for Data Management, TDWI
What exactly is Big Data Analytics?
It’s two things: big data and the kind of analytics users want to do with big data. Let’s start with big data, then come back to analytics.
Users interviewed by TDWI state that data isn’t big until it breaks 10Tb. So that’s the low end of big data. And some user organizations have cached away hundreds of terabytes--just for analytics. The size of big data is relative; hundreds of TBs isn’t new, but hundred just for analytics is—at least, for most user organizations.
Big Data is all about multi-terabyte datasets, right?
No, there’s more to it than that. Size aside, there are other ways to define big data. In particular, big data tends to be diverse, and it’s the diversity that drives up the data volume. For example, analytic methods that are on the rise need to correlate data points drawn from many sources, both in the enterprise and outside it. Furthermore, one of the new things about analytics is that it’s NOT just based on structured data, but on unstructured data (like human language text) and semi-structured data (like XML files, RSS feeds), and data derived from audio and video. Again, the diversity of data types drives up data volume.
Finally, big data can be defined by its velocity or speed. This may also be defined by the frequency of data generation. For example, think of the stream of data coming off of any kind of sensor, say thermometers sensing temperature, microphones listening for movement in a secure area, or video cameras scanning for a specific face in a crowd. With sensor data flying at you relentlessly in real time, data volumes get big in a hurry. Even more challenging, the analytics that go with streaming data have to make sense of the data and possibly take action—all in real time.
Hence, big data is more than large datasets. It’s also about diverse data sources or data types (and these may be arriving at various speeds), plus the challenges of analyzing data in these demanding circumstances.
What kinds of analytics go with big data?
The kind of analytics applied to big data is often called “advanced analytics.” A better term would be “discovery analytics” because that’s what users are trying to accomplish. In other words, with big data analytics, the user is typically a business analyst who is trying to discover new business facts that no one in the enterprise knew before. To do that, you need large volumes of data that has a lot of details. And this is usually data that the enterprise has not tapped for analytics. For example, in the middle of the recent economic recession, companies were constantly being hit by new forms of customer churn. To discover the root cause of the newest form of churn, a business analyst grabs several terabytes of detailed data drawn from operational applications to get a view of recent customer behaviors. He may mix that data with historic data from a data warehouse. Dozens of queries later, he’s discovered a new churn behavior in a subset of the customer base. With any luck, he’ll turn that information into an analytic model, with which the company can track and predict the new form of churn.
What kind of analytic tool does a business analyst need for the “discovery analytics” that’s common with big data?
Discovery analytics against big data can be enabled by different types of analytic tools, including those based on SQL queries, data mining, statistical analysis, fact clustering, data visualization, natural language processing, text analytics, artificial intelligence, and so one. It’s quite an arsenal of tool types, and savvy users get to know their analytic requirements first before deciding which tool type is appropriate to their needs.
Is big data a problem just to be managed (with its size, diversity, and speed) or is it an opportunity to be seized?
TDWI is currently running an Internet-based survey about big data analytics. An early extraction of survey data shows that only 30% of users responding to the survey are concerned about the technical challenges of collecting and managing big data. The vast majority – namely 70% percent of the users responding to the survey – say that big data is definitely an opportunity. That’s because through analysis the user organization can discovery new facts about their customers, markets, partners, costs, and operations, then use that information for business advantage.
So, what do you think, folks? Let me know. Thanks!
========================================
Don’t miss
TDWI’s Big Data Analytics Survey. Please share your opinions and experiences by taking the
online survey.
Posted by Philip Russom, Ph.D. on June 21, 20110 comments
Blog by Philip Russom
Research Director for Data Management, TDWI
In prior blogs, I’ve talked about how big data’s primary attribute is data volume. That’s pretty obvious. But it’s defined by other characteristics, too. For example, one of the things that makes big data so big is that it’s coming from a greater variety of sources than ever before. Now let’s look at the last of the three Vs of Big Data Analytics, namely data velocity.
Data Feed Velocity as a defining attribute of Big Data
Big data can be described by its velocity or speed. Or you may prefer to think of it as the frequency of data generation or frequency of data delivery. For example, think of the stream of data coming off of any kind of sensor, say thermometers sensing temperature, microphones listening for movement in a secure area, or video cameras scanning for a specific face in a crowd. This isn’t new; many firms have been collecting click stream data off of Web sites for years, using streaming data to make purchase recommendations to Web visitors. With sensor and Web data flying at you relentlessly in real time, data volumes get big in a hurry. Even more challenging, the analytics that go with streaming data have to make sense of the data and possibly take action—all in real time.
So you don’t think this is all science fiction, allow me to share some of the use cases for high-velocity data feeds and streams that I’ve heard recently.
Here’s an unsubstantiated anecdote that someone told me: “There’s a cell service provider in Japan that collects GPS data from cell phone users. The cell provider collects the data in real time, and keeps track of which people are walking the furthest. Once a month, the cell provider gives an award to the walker who covered the greatest distance. In a way, cell phones are working like sensors to collect and analyze streaming big data.”
I also heard a similar anecdote: “Imagine that I’m a consumer walking around downtown in a city, and I’m shopping. Now imagine letting a shopping service know where I am, plus maybe the kinds of goods I’m looking for. As I walk, the GPS coordinates could stream to the shopping service, and it could point me to stores that match my interests.”
A consultant who specializes in streaming data told me about some video and audio analytic applications he’s looking into: “Think about the algorithms that enable us to parse text and perform sentiment analysis, sometimes in real time. Very similar algorithms can parse video images to document and analyze changes in the thing that’s being imaged. Satellite images could monitor and analyze troop movements, a flood plane, cloud patterns, and grass fires. Or a video analysis system could monitor a sensitive or valuable facility, watching for possible intruders, then alert authorities in real time. You can implement similar applications with sound monitoring; one of my apps involves two thousand underground microphones to listen for movement in geologic formations. I hope it can eventually help predict earthquakes.”
Here’s a related user story about streaming big data that I heard recently: “You don’t need all of the streaming data. You just need the interesting pieces or just the one piece that identifies what you’re looking for. We’ve all seen video footage from the US military’s unmanned jet drones. A drone is processing several frames of video per second looking for shapes or light signatures that match its programming. For example, it might be looking for shapes that look like tanks or sun reflections that could come from metallic weapons. The drone deletes almost all of the frames, because they’re not of interest. And that helps avoid data glut that could choke the system.”
A prominent Internet-based business told me a few weeks ago: “We load 200 gigabytes a day into our data warehouse. But that’s processed down from several terabytes of Web log and click-stream data. We mix this big data with data about our customers drawn from other touch points, then analyze it. Although the data is streaming, we collect the stream on disk, then process it down and analyze it over night. Our next step is to process and analyze streaming big data in real time. We’re definitely a customer-oriented business, so understanding customers and serving them better is the goal of analytics. We just need to do it both after the fact in batch and – eventually – in real time.”
So, what do you think, folks? Let me know. Thanks!
========================================
This blog is number 3 in a series of 3, all about the three Vs of big data analytics, namely data volume, variety, and velocity. You can read the first blog here. And you can read the second blog here.
Don’t miss TDWI’s Big Data Analytics Survey. Please share your opinions and experiences by taking the online survey.
Posted by Philip Russom, Ph.D. on June 17, 20110 comments
Blog by Philip Russom
Research Director for Data Management, TDWI
This blog is number 2 in a series of 3, about the three Vs of big data analytics, namely data volume, variety, and velocity. You can read the first blog here online.
Data Type Variety as a defining attribute of Big Data
One of the things that makes big data big is that it’s coming from a greater variety of sources than ever before. Many of the newer ones are Web sources (logs, click streams, and social media). Sure, user organizations have been collecting Web data for years. But, for most organizations, it’s been a kind of hoarding. We’ve seen similar un-tapped big data collected and hoarded, such as RFID data from supply chain apps, text data from call center apps, semi-structured data from various insurance processes, and geospatial data in logistics. What’s changed is that far more users are now analyzing big data, instead of merely hoarding it. And the few organizations that have been analyzing it, now do so at a more complex and sophisticated level. A related point is that big data isn’t new; but the effective leverage of it for analytics is. (For more on that point, see my blog: The Intersection of Big Data and Advanced Analytics.)
But my real point for this blog is that the recent tapping of these sources means that so-called structured data (which previously held unchallenged hegemony in analytics) is now joined (both figuratively and literally) by unstructured data (text and human language) and semi-structured data (XML, RSS feeds). There’s also data that’s hard to categorize, as it comes from audio, video, and other devices. Plus, multidimensional data can be drawn from a data warehouse to add historic context to big data. I hope you realize that’s a far more eclectic mix of data types than analytics has ever seen (or any discipline within BI, for that matter). So, with big data, variety is just as big as volume. Plus, variety and volume tend to fuel each other.
To further support the point that big data is about variety, let’s look at Hadoop. I managed to find a couple of users who’ve used Hadoop as an analytic database. Both said the same thing: Hadoop’s scalability for big data volumes is impressive. But the real reason they’re working with Hadoop is its ability to manage a very broad range of data types in its file system, plus process analytic queries via MapReduce across numerous eccentric data types.
Stay tuned for the third and final blog in this series, which will be titled: The Three Vs of Big Data Analytics: VELOCITY.
=============================================
NOTE -- Don’t miss TDWI’s Big Data Analytics Survey. Please share your opinions and experiences by taking this online survey.
Posted by Philip Russom, Ph.D. on June 14, 20110 comments
Blog by Philip Russom
Research Director for Data ManagementTDWI
I was recently on a group call along with several other analysts where IBMers spelled out their definition of big data. They structured the definition by explaining big data’s primary attributes, namely data volume, data type variety, and the velocity of streams and other real time data. I don’t necessarily agree with everything the IBMers said, but I must say that the three Vs of big data – volume, variety, and velocity – constitute a more comprehensive definition than I’ve heard elsewhere. In particular, the three Vs bust the myth that big data is only about data volume. Plus, the term “three Vs” is a catchy mnemonic. So I freely admit that I am shamelessly stealing the concept of the three Vs as a structure for my own definition of big data.
Note that IBMers didn’t consistently link big data with advanced analytics – but I will. This blog focuses on data volume, whereas other upcoming blogs will hit data type variety and data stream velocity.
Data Volume as a defining attribute of Big Data
It’s pretty obvious that data volume is the primary attribute of big data. With that in mind, some people have asked me for a definitive number quantifying the volume, a common question being: “Exactly how many terabytes constitute big data?” In some user interviews I’ve conducted lately, users have said that big data used to start at 3 terabytes, but now the bottom threshold is more like 10 terabytes. In a 2010 TDWI Technology Survey, a third of users surveyed said they will have 10 terabytes within three years. So 3 to 10 terabytes seems an accurate baseline – for now.
But there’s a catch. Note that my research isn’t about just any big data; it’s about big data collected specifically for analytics. So the numbers quoted above are only for analytic datasets -- not all BI data stores and certainly not every bit and byte in an enterprise.
Here are some comments from the field that add more attributes to big data quantification. I asked one user how many terabytes he’s managing for analytics, and he said: “I don’t know, because I don’t have to worry about storage. IT provides it generously, and I tap it like crazy.” Another user said: “We don’t count terabytes. We count records. My analytic database for quality assurance alone has 3 billion records. There’s another 3 billion in other analytic databases.”
From this we see that big data is a moving target that’s growing, there are different units for quantifying it, and it varies with scope (e.g., analytics vs BI vs whole enterprise). In future blogs in this series, we’ll see that data variety and velocity are just as important as volume when it comes to defining big data. Please stay tuned for those blogs.
So, what do you think, folks? Let me know. Thanks!
======================================================
Don’t miss TDWI’s Big Data Analytics Survey. Please share your opinions and experiences by taking the survey online: http://bit.ly/jxWh9N
Posted by Philip Russom, Ph.D. on June 9, 20110 comments
Blog by Philip Russom
Research Director for Data Management, TDWI
I recently chatted with Paul Groom, the VP of Business Intelligence at Kognitio. Among other things, Paul had some great tips for moving beyond common barriers to analytics with big data. I’d like to share some of those tips with you.
Philip Russom: I’ve encountered several user companies that are hoarding big data – especially log data from Web sites – but they don’t know how to get started with analyzing it. Are you seeing this, too?
Paul Groom: Yes. I call it “data car parking.” Over time, the data car park gets so big that it’s a psychological barrier to taking any kind of action. For some reason, many data warehouse professionals think they have to process the entire data car park all at once – with the usual ETL, data quality, and data modeling techniques – before analytics can commence. That particular mindset is a show-stopper for big data analytics.
Philip Russom: In data warehousing, we’re taught that transforming, cleansing, and modeling data are requirements, because reports require squeaky clean, auditable data. But analytics and big data have different requirements. Right?
Paul Groom: Right. OLAP aside, most analytic methods require large data samples of highly detailed data drawn straight from operational sources. That’s because a business analyst is trying to discover unknown business facts in previously untapped data, which differs from data warehousing that reports on known business facts based on well-understood data. Careful data preparation is desirable in data warehousing for reports, but it’s actually a problem with analytics, because data prep strips out the details and granularity that analytics depends on. Oddly enough, when users figure out that they should forego most of the data prep they’re used to in data warehousing, it removes a barrier so they can proceed to analytics with the big data they’ve been caching away.
Philip Russom: I’ve been talking up the perils of data prep for analytics for about two years now. Even when users get the point, they’re still skeptical about the next step, namely complex analytic queries against non-optimized big data.
Paul Groom: We get a lot of that, too. The skepticism is natural, because data warehouse pros have been using hand-me-down database management systems designed for transaction processing, and these don’t perform well with complex analytic queries. But the newest generation of analytic databases does. Assuming you have one of these, such as Kognitio WX2, then the rules of the analytic game just changed.
Our mantra is: “Trust the database.” A modern analytic database can quickly execute any query you come up with, without need for time-consuming data prep or repetitive tweaking of queries and data models. Once users build confidence in new database performance, it removes another barrier to analytics.
So, what do you think, folks? Let me know. Thanks!
Posted by Philip Russom, Ph.D. on May 31, 20110 comments