Blog by Philip Russom
Research Director for Data Management, TDWI
What exactly is Big Data Analytics?
It’s two things: big data and the kind of analytics users want to do with big data. Let’s start with big data, then come back to analytics.
Users interviewed by TDWI state that data isn’t big until it breaks 10Tb. So that’s the low end of big data. And some user organizations have cached away hundreds of terabytes--just for analytics. The size of big data is relative; hundreds of TBs isn’t new, but hundred just for analytics is—at least, for most user organizations.
Big Data is all about multi-terabyte datasets, right?
No, there’s more to it than that. Size aside, there are other ways to define big data. In particular, big data tends to be diverse, and it’s the diversity that drives up the data volume. For example, analytic methods that are on the rise need to correlate data points drawn from many sources, both in the enterprise and outside it. Furthermore, one of the new things about analytics is that it’s NOT just based on structured data, but on unstructured data (like human language text) and semi-structured data (like XML files, RSS feeds), and data derived from audio and video. Again, the diversity of data types drives up data volume.
Finally, big data can be defined by its velocity or speed. This may also be defined by the frequency of data generation. For example, think of the stream of data coming off of any kind of sensor, say thermometers sensing temperature, microphones listening for movement in a secure area, or video cameras scanning for a specific face in a crowd. With sensor data flying at you relentlessly in real time, data volumes get big in a hurry. Even more challenging, the analytics that go with streaming data have to make sense of the data and possibly take action—all in real time.
Hence, big data is more than large datasets. It’s also about diverse data sources or data types (and these may be arriving at various speeds), plus the challenges of analyzing data in these demanding circumstances.
What kinds of analytics go with big data?
The kind of analytics applied to big data is often called “advanced analytics.” A better term would be “discovery analytics” because that’s what users are trying to accomplish. In other words, with big data analytics, the user is typically a business analyst who is trying to discover new business facts that no one in the enterprise knew before. To do that, you need large volumes of data that has a lot of details. And this is usually data that the enterprise has not tapped for analytics. For example, in the middle of the recent economic recession, companies were constantly being hit by new forms of customer churn. To discover the root cause of the newest form of churn, a business analyst grabs several terabytes of detailed data drawn from operational applications to get a view of recent customer behaviors. He may mix that data with historic data from a data warehouse. Dozens of queries later, he’s discovered a new churn behavior in a subset of the customer base. With any luck, he’ll turn that information into an analytic model, with which the company can track and predict the new form of churn.
What kind of analytic tool does a business analyst need for the “discovery analytics” that’s common with big data?
Discovery analytics against big data can be enabled by different types of analytic tools, including those based on SQL queries, data mining, statistical analysis, fact clustering, data visualization, natural language processing, text analytics, artificial intelligence, and so one. It’s quite an arsenal of tool types, and savvy users get to know their analytic requirements first before deciding which tool type is appropriate to their needs.
Is big data a problem just to be managed (with its size, diversity, and speed) or is it an opportunity to be seized?
TDWI is currently running an Internet-based survey about big data analytics. An early extraction of survey data shows that only 30% of users responding to the survey are concerned about the technical challenges of collecting and managing big data. The vast majority – namely 70% percent of the users responding to the survey – say that big data is definitely an opportunity. That’s because through analysis the user organization can discovery new facts about their customers, markets, partners, costs, and operations, then use that information for business advantage.
So, what do you think, folks? Let me know. Thanks!
========================================
Don’t miss
TDWI’s Big Data Analytics Survey. Please share your opinions and experiences by taking the
online survey.
Posted by Philip Russom, Ph.D. on June 21, 20110 comments
Blog by Philip Russom
Research Director for Data Management, TDWI
In prior blogs, I’ve talked about how big data’s primary attribute is data volume. That’s pretty obvious. But it’s defined by other characteristics, too. For example, one of the things that makes big data so big is that it’s coming from a greater variety of sources than ever before. Now let’s look at the last of the three Vs of Big Data Analytics, namely data velocity.
Data Feed Velocity as a defining attribute of Big Data
Big data can be described by its velocity or speed. Or you may prefer to think of it as the frequency of data generation or frequency of data delivery. For example, think of the stream of data coming off of any kind of sensor, say thermometers sensing temperature, microphones listening for movement in a secure area, or video cameras scanning for a specific face in a crowd. This isn’t new; many firms have been collecting click stream data off of Web sites for years, using streaming data to make purchase recommendations to Web visitors. With sensor and Web data flying at you relentlessly in real time, data volumes get big in a hurry. Even more challenging, the analytics that go with streaming data have to make sense of the data and possibly take action—all in real time.
So you don’t think this is all science fiction, allow me to share some of the use cases for high-velocity data feeds and streams that I’ve heard recently.
Here’s an unsubstantiated anecdote that someone told me: “There’s a cell service provider in Japan that collects GPS data from cell phone users. The cell provider collects the data in real time, and keeps track of which people are walking the furthest. Once a month, the cell provider gives an award to the walker who covered the greatest distance. In a way, cell phones are working like sensors to collect and analyze streaming big data.”
I also heard a similar anecdote: “Imagine that I’m a consumer walking around downtown in a city, and I’m shopping. Now imagine letting a shopping service know where I am, plus maybe the kinds of goods I’m looking for. As I walk, the GPS coordinates could stream to the shopping service, and it could point me to stores that match my interests.”
A consultant who specializes in streaming data told me about some video and audio analytic applications he’s looking into: “Think about the algorithms that enable us to parse text and perform sentiment analysis, sometimes in real time. Very similar algorithms can parse video images to document and analyze changes in the thing that’s being imaged. Satellite images could monitor and analyze troop movements, a flood plane, cloud patterns, and grass fires. Or a video analysis system could monitor a sensitive or valuable facility, watching for possible intruders, then alert authorities in real time. You can implement similar applications with sound monitoring; one of my apps involves two thousand underground microphones to listen for movement in geologic formations. I hope it can eventually help predict earthquakes.”
Here’s a related user story about streaming big data that I heard recently: “You don’t need all of the streaming data. You just need the interesting pieces or just the one piece that identifies what you’re looking for. We’ve all seen video footage from the US military’s unmanned jet drones. A drone is processing several frames of video per second looking for shapes or light signatures that match its programming. For example, it might be looking for shapes that look like tanks or sun reflections that could come from metallic weapons. The drone deletes almost all of the frames, because they’re not of interest. And that helps avoid data glut that could choke the system.”
A prominent Internet-based business told me a few weeks ago: “We load 200 gigabytes a day into our data warehouse. But that’s processed down from several terabytes of Web log and click-stream data. We mix this big data with data about our customers drawn from other touch points, then analyze it. Although the data is streaming, we collect the stream on disk, then process it down and analyze it over night. Our next step is to process and analyze streaming big data in real time. We’re definitely a customer-oriented business, so understanding customers and serving them better is the goal of analytics. We just need to do it both after the fact in batch and – eventually – in real time.”
So, what do you think, folks? Let me know. Thanks!
========================================
This blog is number 3 in a series of 3, all about the three Vs of big data analytics, namely data volume, variety, and velocity. You can read the first blog here. And you can read the second blog here.
Don’t miss TDWI’s Big Data Analytics Survey. Please share your opinions and experiences by taking the online survey.
Posted by Philip Russom, Ph.D. on June 17, 20110 comments
Blog by Philip Russom
Research Director for Data Management, TDWI
This blog is number 2 in a series of 3, about the three Vs of big data analytics, namely data volume, variety, and velocity. You can read the first blog here online.
Data Type Variety as a defining attribute of Big Data
One of the things that makes big data big is that it’s coming from a greater variety of sources than ever before. Many of the newer ones are Web sources (logs, click streams, and social media). Sure, user organizations have been collecting Web data for years. But, for most organizations, it’s been a kind of hoarding. We’ve seen similar un-tapped big data collected and hoarded, such as RFID data from supply chain apps, text data from call center apps, semi-structured data from various insurance processes, and geospatial data in logistics. What’s changed is that far more users are now analyzing big data, instead of merely hoarding it. And the few organizations that have been analyzing it, now do so at a more complex and sophisticated level. A related point is that big data isn’t new; but the effective leverage of it for analytics is. (For more on that point, see my blog: The Intersection of Big Data and Advanced Analytics.)
But my real point for this blog is that the recent tapping of these sources means that so-called structured data (which previously held unchallenged hegemony in analytics) is now joined (both figuratively and literally) by unstructured data (text and human language) and semi-structured data (XML, RSS feeds). There’s also data that’s hard to categorize, as it comes from audio, video, and other devices. Plus, multidimensional data can be drawn from a data warehouse to add historic context to big data. I hope you realize that’s a far more eclectic mix of data types than analytics has ever seen (or any discipline within BI, for that matter). So, with big data, variety is just as big as volume. Plus, variety and volume tend to fuel each other.
To further support the point that big data is about variety, let’s look at Hadoop. I managed to find a couple of users who’ve used Hadoop as an analytic database. Both said the same thing: Hadoop’s scalability for big data volumes is impressive. But the real reason they’re working with Hadoop is its ability to manage a very broad range of data types in its file system, plus process analytic queries via MapReduce across numerous eccentric data types.
Stay tuned for the third and final blog in this series, which will be titled: The Three Vs of Big Data Analytics: VELOCITY.
=============================================
NOTE -- Don’t miss TDWI’s Big Data Analytics Survey. Please share your opinions and experiences by taking this online survey.
Posted by Philip Russom, Ph.D. on June 14, 20110 comments
Blog by Philip Russom
Research Director for Data ManagementTDWI
I was recently on a group call along with several other analysts where IBMers spelled out their definition of big data. They structured the definition by explaining big data’s primary attributes, namely data volume, data type variety, and the velocity of streams and other real time data. I don’t necessarily agree with everything the IBMers said, but I must say that the three Vs of big data – volume, variety, and velocity – constitute a more comprehensive definition than I’ve heard elsewhere. In particular, the three Vs bust the myth that big data is only about data volume. Plus, the term “three Vs” is a catchy mnemonic. So I freely admit that I am shamelessly stealing the concept of the three Vs as a structure for my own definition of big data.
Note that IBMers didn’t consistently link big data with advanced analytics – but I will. This blog focuses on data volume, whereas other upcoming blogs will hit data type variety and data stream velocity.
Data Volume as a defining attribute of Big Data
It’s pretty obvious that data volume is the primary attribute of big data. With that in mind, some people have asked me for a definitive number quantifying the volume, a common question being: “Exactly how many terabytes constitute big data?” In some user interviews I’ve conducted lately, users have said that big data used to start at 3 terabytes, but now the bottom threshold is more like 10 terabytes. In a 2010 TDWI Technology Survey, a third of users surveyed said they will have 10 terabytes within three years. So 3 to 10 terabytes seems an accurate baseline – for now.
But there’s a catch. Note that my research isn’t about just any big data; it’s about big data collected specifically for analytics. So the numbers quoted above are only for analytic datasets -- not all BI data stores and certainly not every bit and byte in an enterprise.
Here are some comments from the field that add more attributes to big data quantification. I asked one user how many terabytes he’s managing for analytics, and he said: “I don’t know, because I don’t have to worry about storage. IT provides it generously, and I tap it like crazy.” Another user said: “We don’t count terabytes. We count records. My analytic database for quality assurance alone has 3 billion records. There’s another 3 billion in other analytic databases.”
From this we see that big data is a moving target that’s growing, there are different units for quantifying it, and it varies with scope (e.g., analytics vs BI vs whole enterprise). In future blogs in this series, we’ll see that data variety and velocity are just as important as volume when it comes to defining big data. Please stay tuned for those blogs.
So, what do you think, folks? Let me know. Thanks!
======================================================
Don’t miss TDWI’s Big Data Analytics Survey. Please share your opinions and experiences by taking the survey online: http://bit.ly/jxWh9N
Posted by Philip Russom, Ph.D. on June 9, 20110 comments
Blog by Philip Russom
Research Director for Data Management, TDWI
I recently chatted with Paul Groom, the VP of Business Intelligence at Kognitio. Among other things, Paul had some great tips for moving beyond common barriers to analytics with big data. I’d like to share some of those tips with you.
Philip Russom: I’ve encountered several user companies that are hoarding big data – especially log data from Web sites – but they don’t know how to get started with analyzing it. Are you seeing this, too?
Paul Groom: Yes. I call it “data car parking.” Over time, the data car park gets so big that it’s a psychological barrier to taking any kind of action. For some reason, many data warehouse professionals think they have to process the entire data car park all at once – with the usual ETL, data quality, and data modeling techniques – before analytics can commence. That particular mindset is a show-stopper for big data analytics.
Philip Russom: In data warehousing, we’re taught that transforming, cleansing, and modeling data are requirements, because reports require squeaky clean, auditable data. But analytics and big data have different requirements. Right?
Paul Groom: Right. OLAP aside, most analytic methods require large data samples of highly detailed data drawn straight from operational sources. That’s because a business analyst is trying to discover unknown business facts in previously untapped data, which differs from data warehousing that reports on known business facts based on well-understood data. Careful data preparation is desirable in data warehousing for reports, but it’s actually a problem with analytics, because data prep strips out the details and granularity that analytics depends on. Oddly enough, when users figure out that they should forego most of the data prep they’re used to in data warehousing, it removes a barrier so they can proceed to analytics with the big data they’ve been caching away.
Philip Russom: I’ve been talking up the perils of data prep for analytics for about two years now. Even when users get the point, they’re still skeptical about the next step, namely complex analytic queries against non-optimized big data.
Paul Groom: We get a lot of that, too. The skepticism is natural, because data warehouse pros have been using hand-me-down database management systems designed for transaction processing, and these don’t perform well with complex analytic queries. But the newest generation of analytic databases does. Assuming you have one of these, such as Kognitio WX2, then the rules of the analytic game just changed.
Our mantra is: “Trust the database.” A modern analytic database can quickly execute any query you come up with, without need for time-consuming data prep or repetitive tweaking of queries and data models. Once users build confidence in new database performance, it removes another barrier to analytics.
So, what do you think, folks? Let me know. Thanks!
Posted by Philip Russom, Ph.D. on May 31, 20110 comments
Blog by Philip Russom
Research Director for Data Management, TDWI
I just got off the phone with Ellie Fields, the director of product marketing at Tableau Software. Ellie has a lot to say to about intersections among big data, analytics, and data visualization. So allow me to recount the high spots of the conversation.
Philip Russom: Tableau is often pigeon-holed as a data visualization vendor. But the Tableau users I’ve met are using the tool for analytics. How does Tableau position itself?
Ellie Fields: Our customers use Tableau in different ways. For example, many use us as their primary, enterprise BI platform. Others use us for specific BI applications within a department. Still other customers use Tableau for fast analytics, as a complement to a legacy BI platform. Given the breadth of use, we see ourselves as a multi-purpose BI platform.
Philip Russom: I’ve seen demonstrations of the Tableau tool, so I know that ease-of-use is high. But is it high enough to enable self-service BI?
Ellie Fields: The Tableau tool was designed with self-service in mind for a broad range of BI users. For example, with a few mouse clicks, a user can access a database, identify data structures of interest, and bring data into server memory for reporting or analysis. The user needs to know the basics of enterprise data, but doesn’t need to wait for assistance from IT. With a few more clicks, you can publish your work for colleagues to use. Going back to your question about positioning, we describe this quick and easy method as “rapid fire business intelligence.”
Philip Russom: What’s the relationship between data visualization and big data?
Ellie Fields: As you know, Tableau is strongly visual. In fact, the visual images representing data are an extension of the user interface, in that you grab your mouse and – with simple drag-and-drop methods – you interact directly with the visualization and other visual controls to form queries, reports, and analyses. Analysis is iterative, and iterations need to flow fast. The drag-and-drop environment enables an analyst to work quickly, without losing the train of thought, and even to collaborate with others on live data. So, we’re fast with results – even against big data.
When working with big data, all of our visualizations scale up and down, in that they can represent ten data points from a spreadsheet or ten million rows of big data. And when working with big data, visualization is even more important. It’s how humans explore and consume information to arrive at a conclusion. Analytics without good visualization is hamstrung from the beginning.
Philip Russom: What types of analytic applications have you seen in your customer base recently?
Ellie Fields: Many of our customers practice what we call “exploratory analytics.” This is especially important with big data, where the point is to explore and discover things you didn’t already know. For example, we have a lot of Web companies as customers, and they depend on advertizing for revenue. As they explore big data, they’re answering analytic questions like: “How do small ads compare to big ones? Or which colors in an ad sell the most?” Yahoo! is a customer, and they analyze online ads by many dimensions, including size, color, location, frequency, Web site locations, revenue, and so on.
High tech manufacturing stands out as a growing area, especially analytics for monitoring product and supply quality. Healthcare, finance, and education companies have also adopted Tableau. One healthcare client analyzes its supply chain to be sure all locations are equipped adequately. Another hospital uses analytics to optimize nurse staffing. And a university client analyzes trends in SAT scores to enlighten decisions about recruitment, scholarships, and educational curricula.
So, what do you think, folks? Let me know. Thanks!
Note: The next TDWI Solution Summit, September 25-27 in San Diego, will feature case studies focused on the theme of “Deep Analytics for Big Data.”
Posted by Philip Russom, Ph.D. on May 19, 20110 comments
When you’re 100 years old, as IBM is this year, it would be easy to think that you’ve seen it all. What could possibly be new to Big Blue about “big data”? In the view of Robert LeBlanc, SVP of Middleware Software for the IBM Software Group, quite a bit.
The new problem set, defined by business opportunities opening up due to the availability of new sources of information, cannot be solved with traditional data systems alone. Kicking off the IBM Big Data Symposium for industry analysts at the Yorktown Research Center on May 11, LeBlanc itemized a number of challenges, including multi-channel customer sentiment and experience analysis, detection of life-threatening conditions at hospitals in time to intervene, Medicare fraud interdiction before payment, and weather pattern predictions to optimize wind turbine locations. (Note: The next TDWI Solution Summit, September 25-27 in San Diego, will feature case studies focused on the theme of “Deep Analytics for Big Data.”)
“Big data” is both an evolutionary and revolutionary phenomenon. Given that organizations have been working with large data warehouses and other types of files for some time, it should come as no surprise that the sheer quantity of data would continue to grow. Data is a renewable resource; the more applications and systems that use it, the more data that they tend to generate. Data warehouses will continue to be important, but even as the terabytes of structured data pile up, organizations are hunting down unstructured sources to tap their value and discover new competitive advantages.
IBM’s view of what makes big data revolutionary comes down to the convergence of the three “V’s”: volume, velocity, and variety. Volume is the easiest to understand, although IBM speakers at the Symposium described scenarios where so much data was streaming through in real time that storing it was impossible. Huge data volumes plus the velocity with which it is flowing in are opening up opportunities for technology alternatives, including Hadoop, MapReduce, and event stream processing. Variety, the third “V,” adds in the unstructured and complex data sources growing up on the Web, particularly in social media. Some organizations, of course, do store all this data; Eric Baldeschwieler, VP of Hadoop Development at Yahoo!, described their use of the Hadoop Distributed File System (HDFS) to store petabytes of data on nodes through its vast array of clusters. “Hadoop is behind everything we do,” he said.
It was not surprising news, but Baldeschwieler and IBM experts gave a full-throated defense of Apache Hadoop and the importance of having open source software at the foundation of big data programs. IBM did not mention EMC explicitly, but it was clear that the company was responding to EMC’s May 9 announcement of the new Greenplum HD Data Computing Appliance, which offers its own distribution of Apache Hadoop. IBM execs warned of the dangers of “forking,” which is what happened when vendors created their own versions of the UNIX operating system and users had to deal with competing standards. Baldeschwieler and IBM execs did acknowledge, however, that Apache Hadoop is far from a finished product, and in any case is not the solution to all problems.
I came away from the Symposium excited by the future of big data analytics but also aware that there’s a long way to go. “Big data” is not about a single technology, such as Hadoop or MapReduce (for more on Hadoop, see my colleague, Philip Russom’s interview with the CEO of Cloudera here). These technologies are more of a complement to data warehousing rather than replacement for it. Yahoo!’s Baldeschwieler made the point that Yahoo also has data warehouses. As each industry’s requirements become clearer, vendors such as IBM will assemble packages that will bring together the strengths in their existing solutions with new technologies. Then, organizations will have a better understanding of how to compare the vendors’ offerings. We’re not quite there yet.
Posted by David Stodder on May 17, 20110 comments
Blog by Philip Russom, Research Director for Data Management, TDWI
I recently had a great phone conversation with Mike Olson, the CEO of Cloudera. Mike has a gift for explaining new and complex technologies and their emerging best practices. Let me share a few of Mike’s insights.
Philip Russom: My understanding is that Cloudera makes a business by distributing open source software, namely MapReduce-based Apache Hadoop. Is that right?
Mike Olson: Well, that’s part of it. Cloudera does a lot more than simply distribute open source Hadoop. We make Hadoop viable for serious enterprise users by also providing technical support, upgrades, administrative tools for Hadoop clusters, professional services, training, and Hadoop certification. Furthermore, our distribution package of Hadoop includes more than Hadoop. So Cloudera collects and develops additional components to strengthen and extend Hadoop.
Philip Russom: So, what is Hadoop?
Mike Olson: Essentially there are two pieces in Hadoop. First, there’s the Hadoop Distributed File System (or HDFS), which can manage big data on clusters of many nodes. Our customers typically start with twenty nodes or so, then quickly grow to fifty or more. Some of our customers have thousands of nodes, managing petabytes of data. A many-node cluster enables big data management, plus other nice benefits like scalability, performance, and high availability. But the ramification is that data is heavily distributed.
That’s where the second piece comes in, namely MapReduce. Thanks to this capability of Hadoop, you can define a data operation--like a query or analysis--and the platform ‘maps’ the operation across all relevant nodes, for distributed processing and data collection. The platform then consolidates and reduces the responses that come back. Due to the distributed processing of MapReduce, analytics against very big data is possible—and with good performance.
Philip Russom: What kind of analytics?
Mike Olson: Hadoop excels in discovering patterns in big data, patterns that you didn’t know were there, in data that you probably don’t know very well. That makes Hadoop the opposite of your average data warehouse query against well-understood relational data. Since Hadoop and a traditional data warehouse are complementary, putting them together gives you a very broad range of business intelligence capabilities.
Philip Russom: What data types and data models are your customers managing?
Mike Olson: In Hadoop, you can mix and match data types to your heart’s content. Hadoop will store anything without requiring a data type declaration. Also, Hadoop is amazingly tolerant of messy data. For example, our customers manage any kind of file you can think of in the HDFS, and these can have just about any kind of data model. This also includes human language text and complex data types. So, big data’s not just big. It’s also highly diverse and complicated. And Hadoop excels in handling data of such extreme size, diversity, and complexity for the purposes of analytics.
So, what do you think, folks? Let me know. Thanks!
Posted by Philip Russom, Ph.D. on May 12, 20110 comments
I’ve recently been interviewing users and business sponsors, asking them about their new practices with advanced analytics, plus the special role of big data. When I ask people to talk about critical factors that make or break or success, they usually come around to a common issue that needs sorting out. It’s the fact that most analytic applications are departmentally focused (often departmentally owned and funded) and they satisfy department requirements, not enterprise ones.
Give me a minute to explain what I’m hearing from users, as well as why big data analytics is progressively a departmental affair:
Analytic applications are departmental, by nature. Just about any analytic application you think of is focused on tasks, data domains, and business opportunities that are associated with specific departments. For example, customer base segmentation should be owned and executed by marketing and sales departments. The actuarial department does risk analysis. The procurement department does supply and supplier analysis.
Most data warehouse (DW) and business intelligence (BI) infrastructure is not designed for advanced analytics. In most organizations, it is, instead, designed and optimized for reporting, performance management, and online analytic processing (OLAP). This enterprise asset is invaluable for “big picture” reports and analyses that span enterprise-wide processes (especially financial ones). And it’s capable of satisfying most departmental requirements for reporting and OLAP. But, in many organizations, the BI/DW infrastructure cannot (and, due to its owners, will not) satisfy departmental requirements for advanced analytics and big data.
Many departments are deploying their own platforms for big data and analytics. They do this when the department has a strong business need for analytics with big data, plus the budget and management sponsorship to back it up. Just think of the many new vendor tools and platforms that have arisen in recent years. Data warehouse appliances, columnar databases, MapReduce, visual discovery tools, and analytic tools for business users all supply analytic functionality that user organizations are demanding at the department level. And all are built from the bottom up to management and operate on big data. Obviously, big data analytics can be implemented on older, more traditional databases and tools, as well.
Put it all together, and this user and vendor activity reveals that big data analytics is progressively a departmental affair, implemented on departmentally owned platforms.
So, what do you think? Does the trend toward departmental big data analytics make sense to you? Let me know. Thanks!
Posted by Philip Russom, Ph.D. on May 10, 20110 comments
I recently started work on a new TDWI Best Practices Report with the working title: Deep Analytics with Big Data. The report is a tad schizophrenic, in that it’s really about two things – big data and analytics – plus how the two have teamed up to create one of the most profound trends in business intelligence (BI) today. Let me share some of the thinking behind the schizophrenia. Please reply to this blog to tell me whether this makes sense or not.
Advanced Analytics
According to a recent TDWI survey, 38% of organizations surveyed are practicing advanced analytics today. But 85% say they’ll do it within 3 years!
Why the rush to advanced analytics? First, change is rampant in business; we’ve been through multiple “economies” in recent years. And analytics helps us discover what changed plus how we should react. Second, there are still many business opportunities to leverage -- even in the recession -- and more will come as we finally crawl out of the recession. To that end, advanced analytics is the best way to discover new customer segments, identify the best suppliers, associate products of affinity, understand sales seasonality, and so on. For these reasons, TDWI has seen an explosion of user organizations implementing analytics in recent years.
But note that user organizations are implementing specific forms of analytics, particularly what is sometimes call advanced analytics. This is a collection of related techniques and tools, usually including predictive analytics, data mining, statistical analysis, and complex SQL. We might also extend the list to cover data visualization, artificial intelligence, natural language processing, and database methods that support analytics.
All these techniques have been around for years, many of them appearing in the 1990s. The thing that’s different now is that far more user organizations are actually using them. That’s because most of these techniques adapt well to very large, multi-terabyte datasets, with minimal data preparation. And that brings us to big data.
Big Data
Big data can be defined simply as multi-terabyte datasets. And this make sense, given that corporations, government agencies, and other user organizations are generating and retaining more data than ever before. Soon enough, big data will involve petabytes, not terabytes Yet, big data also involves big complexity, namely many diverse data sources (both internal and external), data types (structured, unstructured, semi-structured), and indexing schemes (relational, multidimensional, no-SQL).
Occasionally, I hear a user complain about the problems of storing and managing big data. Much more often, however, I hear people talk about what an extraordinary opportunity big data is. That’s because, for the kinds of discovery and prediction that most advanced analytic techniques enable, big data is truly a treasure trove of information that merits leverage for business advantage. And that brings us to the intersection mentioned in the title of this blog.
Advanced Analytics and Big Data: Why put them together?
Here are a few reasons:
Big data yields gigantic statistical samples. Most tools designed for data mining or statistical analysis tend to be optimized for large datasets. In fact, the general rule is that the larger the data sample, the more accurate are the statistics and other products of the analysis. Instead of mining and statistical tools, I regularly find users generating or hand-coding complex SQL, which parses big data in search of just the right customer segment, churn profile, or excessive operational cost. The newest generation of data visualization tools and in-database analytic functions likewise operate on big data.
Analytic tools and databases can now handle big data. And they can execute big queries and parses in record time. Recent generations of vendor tools and platforms have raised us onto a new plateau of performance that’s very compelling for applications involving big data.
There’s a lot to learn from messy data, as long as it’s big. Most modern tools and techniques for advanced analytics and big data are very tolerant of raw source data, with its transactional schema, non-standard data, and poor-quality data. That’s a good thing, because discovery and predictive analytics depend on lots of details, even questionable data. For example, analytic applications for fraud detection often depend on outliers and non-standard data as indications of fraud. If you apply ETL and DQ processes to big data, as you do for a data warehouse, you’ll strip out the very nuggets that make big data a treasure trove for advanced analytics.
Big data is a special asset that merits leverage. And that’s the real point of Deep Analytics with Big Data. The new technologies and new best practices are fascinating, even mesmerizing. And there’s a certain macho coolness to working with dozens of terabytes. But don’t do it for the technology. Put Big Data and Advance Analytics together for the new insights they give the business.
So, what do you think? Does the intersection of Big Data and Advance Analytics make sense to you? Let me know. Thanks!
To learn more, register to attend a TDWI Webinar on this topic. “The Intersection of Big Data and Analytics,” May 5, 2011 at noon eastern time. http://bit.ly/eh5YA9
Posted by Philip Russom, Ph.D. on April 25, 20110 comments