The 3 Vs and Unstructured Data Analytics
Becoming more efficient in all aspects of unstructured data management is key to your analytics programs.
- By Krishna Subramanian
- October 20, 2022
Enterprise data volumes are blowing through the rafters, consuming a significant portion of the IT budget. More than half of IT leaders report that their organizations are managing 5PB or more of data and most (68 percent) are spending more than 30 percent of their IT budget on data storage, backups and disaster recovery, according to the Komprise 2022 State of Unstructured Data Management, a third-party survey conducted earlier this summer and sponsored by my company.
Five petabytes is a lot of data (about 1.25 billion digital photos’ worth, for example) and much of it is unstructured, meaning that it doesn’t fit neatly into rows and columns in a database. This unstructured data -- such as log files, IoT sensor data, microscopic data, application data, user documents, manufacturing test data, and medical images -- is an untapped gold mine for primary research and analysis.
Traditionally, data analytics has relied on data warehouses and mining structured and semi-structured data from spreadsheets or financial documents. Yet this data is just the tip of the iceberg, considering that an estimated 80 percent or more of global data is unstructured. With advances in cloud computing, machine learning (ML), and AI tools, unstructured data analytics is now a prime opportunity. ML algorithms depend upon large quantities of data and the cloud delivers a wide array of on-demand, compute-intensive services to affordably run big data infrastructure and analysis like never before.
Today there are a multitude of cloud-based ML and AI services for different use cases -- from image and audio pattern recognition to personally identifiable information (PII) identification. Some interesting and valuable use cases for unstructured data analytics include medical insurance fraud detection, autonomous vehicle testing, malicious actor detection, precision medicine, and customer sentiment analysis of call center audio files.
The 3 Vs Revisited
The Komprise survey showed that 65 percent of organizations plan to or are already investing in delivering unstructured data to their new analytics/big data platforms. To be successful in unstructured data analytics, you must jump through several hurdles compared to the relatively straightforward process of mining structured data in databases and spreadsheets. Gartner analyst Doug Laney introduced the 3Vs concept in a 2001 MetaGroup research publication, 3D Data Management: Controlling Data Volume, Variety, and Velocity. When it comes to unstructured data, these challenges include:
Volume of data. Because there is so much data in organizations today, you can’t feasibly analyze it or copy it all to a cloud service or big data platform. It takes too long and is too expensive unless you can cull the data for the specific data sets required by data scientists and or files needed by researchers. Efficiently finding the right unstructured data across on-premises, edge, and cloud silos and then moving it to an analytics tool is one of the biggest hurdles today. Adding to this pain is the prevalence of duplicate data. A research group may have teams of people working on the same data set and therefore multiple copies exist across different file shares and geographic locations.
Variety of data. Unlike structured data, unstructured data encompasses many different file types across video, audio, logs, lab notebooks, IoT, and documents. Thus, understanding what types of files match with which data or cloud service is imperative so you’re always using the right tool for the right job. For example, looking for PII in documents is entirely different than finding all images that contain dogs. Different analytics techniques are needed to process different types of unstructured data.
Velocity of data. Data is piling up fast and because of its speed and volume, you can’t often act on it fast enough to place unstructured data into the appropriate storage technology or data lake for analysis. What comes to mind is the iconic “I Love Lucy” episode where Lucy and Ethel fail at their candy factory job once the conveyor belt speeds up, leaving no time to wrap the chocolates and resulting in plenty of waste. Businesses need automation to manage unstructured data because it is impossible to manually handle the velocity, variety, and volume of this data.
Wrangling Unstructured Data
Addressing the challenges of unstructured data volume, variety, and velocity begins with real-time knowledge -- specifically, knowledge of key data characteristics and life cycle as well as knowledge of cloud infrastructure and the big data analytics ecosystem across data centers, the edge, and clouds.
Adopting ways to be more efficient in all aspects of unstructured data management is key to analytics programs. Tactics may include:
- The ability to preprocess data at the edge so it can be analyzed and tagged with new metadata before moving it into a cloud data lake; this can drastically reduce the wasted cost and effort of moving and storing useless data and can minimize the occurrence of data swamps.
- Applying automation to facilitate data segmentation, cleansing, search, enrichment via tagging, deletion or tiering of cold data by policy, and movement into the optimal storage where it can be ingested by big data and ML tools. The Komprise survey found that the leading new approach to unstructured data management is the ability to initiate and execute data workflows.
- Adopting a data management tool that persists metadata tags as data moves from one location to another. For instance, files tagged as containing PII by a third-party ML service should retain those tags indefinitely so that a new research team doesn’t have to run the same analysis over again -- at high cost.
- Planning appropriately for large-scale data migration efforts with thorough diligence and testing that can circumvent common networking and security issues that often derail the timely completion of moving data from one place to another.
Data scientists, analysts, and researchers spend a large proportion of their time trying to find and prepare data for analysis. Data storage teams have historically been tasked with efficiently storing and protecting data and ensuring they have high performance and capacity. Now the role has expanded to partner with IT infrastructure teams to facilitate the movement of the right unstructured data to the right platform so analysts and researchers can spend more time using this data to unearth new insights and innovations.
A Final Word
In this economy, speed is a game changer. The faster you can feed quality data into your analytics platform, the faster you’ll get results and outcomes, and the less time you’ll spend doing it. Storage management and data management have finally converged, thanks to the demand for unstructured data analytics.
Krishna Subramanian is COO, president, and co-founder of Komprise. In her career, Subramanian has built three successful venture-backed IT businesses and was named a “2021 Top 100 Women of Influence” by Silicon Valley Business Journal. You can reach the author at via email, Twitter, or LinkedIn.