Data Fabrics for Big Data
Providing a single platform and a single point for data access across multiple siloed systems helps enterprises struggling with diverse data.
- By Brian J. Dooley
- June 20, 2018
The idea of a fabric connecting computing resources and providing centralized access has been around since the early concepts of grid computing in the 1990s. Fabrics are interconnected structures where multiple nodes appear as a single logical unit. A data fabric is a more recent idea using the same concept, but associated with data rather than with systems. The most recent iteration is "big data fabric," as envisioned by analyst firm Forrester in 2016.
Data fabric concepts have become more important as data sources have diversified. Integrating data is a persistent problem because data from diverse operations is often held in discrete silos. Enterprises need to bring together data from transactional data stores, data warehouses, data lakes, machine logs, unstructured data sources, application storage, social media storage, and cloud storage. Silos are proliferating, particularly with increasing cloud storage and the Internet of Things (IoT). Management, security, reliability, and consistency must be maintained, but democratization and machine learning may add even more complexity while complexity increases.
The term big data fabric is loosely defined at present, representing a need rather than a specific solution. A big data fabric is a system that provides seamless, real-time integration and access across the multiple data silos of a big data system. Many of those labeled specifically as big data fabrics focus on Hadoop, though integration with non-Hadoop storage is equally important. The major vendors are at the forefront, but there are many start-ups with unique offerings, and we can expect that new solutions will emerge to provide efficient and complete data access for specific industries.
Big Data Changes Everything
Transactional database storage was governed by specific processes that ensured accessibility, security, deduplication, accuracy, and field mapping, but the increasing use of unstructured data and data lakes has created significant problems for data management.
The requirements for ensuring accuracy and usability have remained constant, but the ability to manage them has been diminished by increasing variety, velocity, access requirements, and sheer volume. Companies have attempted to handle these issues in a number of ways, including creating solutions specific to individual data silos, loosely federating different silos and connecting them on an application-to-application basis, and supporting virtualization and other techniques.
However, the need for a centralized data access system persists because the need for a single version of the truth -- or at the very least, for only a few compatible versions of the truth -- must always prevail. Big data also adds to concerns around data discovery and security that can only be addressed through a single access mechanism.
To succeed with big data, enterprises need to access data from diverse systems in real time in a digestible format, whether from IoT logs and instrumentation, unstructured voice or image data, structured records, or information stored on peripheral devices. Connecting devices such as smartphones and mobile systems also increases storage access requirements and management issues because data stores may be required at any time to feed real-time information into specific queries. Big data storage today is generally in Hadoop, Apache Spark, NoSQL databases, and other, more recent formats that have special management demands.
Versatility Is Key
The most capable providers of big data fabric solutions are generally the large data and analytics vendors that can provide access to all types of data and bring them into a single consolidated system. A unified data access portal can be provided through a mixture of data movement, data virtualization, and application-to-application connections. Virtualized data storage makes it possible to include data that either cannot easily be transferred or must be prepared and used in real time.
The consolidated system -- the big data fabric -- must address security, handle widely diverse data stores, provide consistent management through unified APIs and software access, provide flexibility across a wide range of implementation requirements, be upgradeable, provide auditability, and automate processes for ingestion, curation, and data integration.
Enterprises have been struggling with integration issues ever since storage departed from the standard SQL-based data warehousing environment. Big data is essentially "the straw that broke the camel's back," as it has brought out an enormous range of new and unstructured data types. Volume and velocity have continued to increase, making the need for instant and reliable access to data of all types a priority.
AI and Future Fabric
Movement into machine learning and artificial intelligence will increase requirements for enormous data stores that become the basis for model training and operations. Providing a single platform and a single point for data access also reduces the complexity of the system from the user's point of view, making it easier to use stored data. Although this doesn't solve the skills shortage, it does make it easier for data scientists to focus upon the details of problem-solving rather than the intricacies of erratic data access.
Brian J. Dooley is an author, analyst, and journalist with more than 30 years' experience in analyzing and writing about trends in IT. He has written six books, numerous user manuals, hundreds of reports, and more than 1,000 magazine features. You can contact the author at [email protected].