TDWI Upside - Where Data Means Business

Small Is Beautiful: The Value of Structured Data

Although big data continues to grow in importance, it may be simple, structured data that will give you all the answers you need.

It will not surprise you to learn that as I write this, #bigdata is being tweeted an average of 1,400 times each hour according to hashtags.org. Big data's power to elucidate user behavior is more or less the holiest of economic grails -- and has been for some time. Every enterprise wants to better understand its consumers, and all the fragmented, unstructured information consumers leave in their digital wake holds the promise of this understanding.

The trouble is, big data is notoriously difficult to wrangle because of its size and complexity. Setting aside for the moment that many enterprises have to purchase access to big data they don't produce themselves, the process of grooming that data for reporting and analysis can be prohibitively expensive. So enterprises of all sizes and creeds end up asking themselves: Do we need big data? If not, will we need it someday soon? If we don't prepare ourselves for it, what do we stand to lose when it becomes an imperative?

What Exactly Is Big Data?

For Further Reading:

Forget Big Data -- Find the Right Data

Relational Technology Has Staying Power

The 10 Vs of Big Data

Our first step in answering these questions is to disambiguate big data, which has taken on a variety of meanings over the years. Let's start with a concrete definition: big data is a mass of information characterized by its high volume, velocity, and variety. High volume means there's a lot of data, high velocity means there is more of it all the time, and high variety means it comes in lots of different formats -- not just your standard strings, integers, and dates but also geospatial data, audio and video media, three-dimensional arrays, and more.

Recent hype surrounding big data has diluted this basic definition and stretched it to encompass a number of other data-related processes. For the purposes of this article, keep in mind what big data is not.

  • Big data is not business intelligence. Business intelligence (BI) tools can be used to analyze and produce reports on big data, but having big data is not the same thing as having a means of analyzing it.

  • Big data is not only digital. Although the Internet is largely responsible for the proliferation of big data, it can come from traditional sources as well.

  • Big data is not just data from outside your company. Enterprises can and do generate their own big data using applications, tracking systems, and devices of their own creation and/or implementation.

  • Big data is not AI. However, the two go hand-in-hand. Big data is complex enough to "teach" artificial intelligence algorithms how to look for patterns and predict outcomes based on existing information, but having big data isn't the same thing as having an AI to analyze it.

Understanding Unstructured Data

At its core, big data really is just a tremendous amount of information. However, because of its high volume, velocity, and variety, big data doesn't fit neatly into the tables that make up relational databases. As a result, a lot of big data is collected in key value pairs instead. Compare this tidy example of a traditional data table you might find in an RDBMS such as MySQL with the value pairs below it.

Structured Relational Data

UserID

Platform

Color

Beverage

12345

Facebook

Red

White Wine

23456

Google+

Totally Teal

Dry Martini with a Twist

Unstructured Big Data

<FacebookUser12345_Color, “Red”>
<Google+User23456_Beverage, “Dry Martini with a Twist”>
<FacebookUser12345_Beverage, “White Wine”>
<Google+User23456_Color, “Totally Teal”>

Whereas the structured data values all exist on the same table and are stored on the same server, the value pairs exist in no particular order and bear no inherent relation to each other. They can even be stored on different machines! Messiness is the price we pay for unstructured data's flexibility and potential. Unstructured data is stored in nonrelational databases such as MongoDB and Hadoop, but to generate reports from it, BI solutions need some sort of organizing layer. Apache Hive is one such layer. It's a data warehouse infrastructure with a SQL-like interface that sits on top of Hadoop and allows BI applications to access the unstructured data via a connector such as ODBC. Because of all these layers, making sense of unstructured data can be a real challenge.

Learning to organize and manage your "small data," often referred to as operational data, can help inform future forays into big data.

The Intricacies of Small Data

What is now considered "small data" used to just be called data. The term was coined to distinguish structured data from big data, and now it carries the stigma of being humdrum and outmoded, at least from a media standpoint.

Though it may be more structured than big data, small data is far from simple. Structured data begins its life cycle as what's called transactional data or denormalized data, and it has to go through a normalization process (a part of ETL) to become reportable. The transformation part of this process includes such steps as eliminating data redundancy, translating coded values, joining tables, and cleaning up user errors.

Transactional data might arrive looking like this:

UserID

UserName

UserGender

SelectedColor

000023

Amy Jones

1

Red

000045

Adams, Jerome

2

Red

003453

James Alvin Avery

2

BLU


and needs to be transformed into this for reporting purposes:

UserID

UserFirstName

UserLastName

UserGender

SelectedColor

23

Amy

Jones

F

Red

45

Jerome

Adams

M

Red

3453

James

Avery

M

Blue

Note that the users' first names have been separated from their last names, differences in input format have been corrected, and the numeric values corresponding to each user's gender have been replaced with string values. This is a simplified example of a process that, for some enterprises, must be repeated for thousands of tables containing hundreds of rows and being accessed by hundreds of tenant groups with different needs and permissions. It's also important to strike a balance between normalized and denormalized tables for reporting purposes because the more normalized a data set is, the more tables it has and the more unwieldy it becomes.

Priming the data so that it behaves the way you need it to can be a profound learning experience, as can analyzing that data. Building reports and visualizations often reveals where your ETL process could use improvement.

The Right Data, Big or Small

The most important thing for you to know is what kinds of questions your company and its competitors are asking. Are they asking questions easily answered by their operational data or are they in search of information they don't yet have?

Maxwell Wessel, the general manager of SAP.io, observes that "most companies spend too much time at the altar of big data" when the small data they already have holds the answers to their questions. What enterprises need to do is stay in tune with their respective industries and practice separating the signal from the noise. When a critical mass of people start asking questions that are truly unanswerable without big data -- when big data is also the right data -- it is time to invest in harnessing unstructured information.

In the meantime, unfettered access to structured, operational data is a great place to start, especially if it's your first foray into business analytics. There's a great deal to be learned in the process, which can help you prepare for the challenge of big data to come.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, & Team memberships available.