There's No Such Thing as Unstructured Data
Just because information isn't stored in a database field doesn't mean it lacks structure.
By Bob Potter, Senior Vice President and General Manager, Rocket Software
"Unstructured data" is a term used frequently in a wide variety of technology contexts. Although it certainly serves a purpose, how well does it stand up to closer examination? It's usually used to describe information that either does not have a predefined data model or lacks a structure that is easy for traditional software applications to access and understand. Text-heavy documents, such as PDFs, might be considered typical examples of unstructured data, and prevailing wisdom would argue that 80 to 90 percent of the world's data is unstructured.
It doesn't take a computer science degree to know just how gigantic that volume of data is. Scientists who analyze information growth and storage say that 90 percent of the world's data has been generated in the last two years, and by the year 2020 there will be 40 zettabytes (1021) of data in existence. Gartner defines big data as the increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources) of data.
Data currently defined as unstructured is the type that flies across the Internet -- currently a pipe with a capacity of 667 exabytes annually. Audio and video data types simultaneously take up massive amounts of space and are being generated at the fastest pace among data types. Database management systems struggle to keep up with the pace of data creation, and, if most of this information is unstructured, then the data management battle becomes even more difficult.
Is unstructured data really unstructured? Information that isn't stored in a field in a relational database accessed by the structured query language (SQL) popularized over 40 years ago by relational database vendors, doesn't necessarily lack structure. In fact, data that lacks the rigid structure dictated by SQL architectures is actually more flexible.
Tagging data and other markers used by languages such as XML and HTML provides enough information for modern software applications to fetch the data elements a user needs. Whatever meaning is lost can be recovered through search engines that index and search through non-relational data types and semantic processors that determine meaning through increasingly sophisticated models. Search has become the new SQL, and almost every contemporary application, whether enterprise or consumer, incorporates some version of newer search capabilities.
JSON has been popularized by Web services developed utilizing REST principles. JSON and REST are used by virtually all modern application developers who desire flexible architectures and wish to include all the data a user may want to access and work with in their applications. More often than not, these new applications are hosted in the cloud and are service based.
Data architectures in most organizations today rely on hybrid architectures involving a myriad of database and file management systems; SQL, NoSQL, NewSQL, MapReduce, Multi Value, Hierarchical, Grid, and more. All data is stored and managed in a variety of locations: in-memory, on disk, or in the cloud. Data management and analytics software companies are growing twice as quickly as the software industry as a whole because there more data is being generated and more people want to get insight from it.
Search technology continues to evolve to meet these demands. Business users and consumers expect to type in any sequence of words that are meaningful to them and find exactly the information they are seeking. Search engine vendors are developing technology that scales to multi-petabytes and connects to virtually any data source regardless of the structure (or lack thereof).
The next time you hear someone use the phrase "unstructured data," politely say, "There is no such thing." They will thank you later.
Bob Potter is senior vice president and general manager of Rocket Software's business information/analytics business unit. He has spent 33 years in the software industry with start-ups and mid-size and large public companies with a focus on BI and data analytics. You can contact the author at [email protected].