The Role of Ontologies within Unified Data Models
Before tackling the complexity of disparate data sources, you need to understand how semantic abstraction layers can save you from a world of pain.
Data management professionals know that how you model your data directly constrains how flexibly you can analyze it.
When accessing analytics in the cloud, users want to tame the complexity of disparate data sources. When you consolidate relational sources that embody divergent data schemas and definitions, you are inviting a world of pain. Rollup of those sources for unified drilldown can't take place until you run it all through a gauntlet of data integration, matching, merging, and cleansing.
Even then, the potential for data to become incoherent is always present. Typically, one must make the resultant data set available in relational third-normal form. Querying across multistructured sources might involve transforming nonrelational data to relational schemas that support SQL access. It might even involve keeping data in its source formats and offering agile query access through an abstraction that can do justice to the myriad semantics. However, that doesn't always ensure that the full original context of data can survive its convoluted transformations.
That's where unified data models can save the day. These semantic abstraction layers ensure that the data assets being consumed have these characteristics:
- Consolidated: Consolidation usually entails bringing together all relevant data into a physically and/or logically integrated data repository. This repository may be a data lake, data warehouse, or another cloud database that has been optimized for analytics, or it may even be a distributed data fabric.
- Cleansed: Data cleansing requires transformation, matching, merging, correcting, and enhancing of all data prior to loading into repositories. Cloud providers often deliver these capabilities through data profiling, data cleansing, data augmentation, and master data management services.
- Current: Having up-to-date data may require accelerating the extraction, preparation, and delivery of data from source applications to business intelligence, reporting, and other consuming applications. To make this happen, the data platform vendor may provide distributed caching, event stream processing, and in-memory data integration capabilities through their own solutions or partner offerings.
- Conformed: Achieving data conformity typically involves harmonizing all relevant data to common formats, vocabularies, schemas, dimensions, and hierarchies. Typically, this involves enabling query, reporting, dashboarding, and other analytics access through APIs to a semantic abstraction layer.
- Comprehensible: As the complexity of data grows, ontologies become a more important tool for ensuring that the unified data model is comprehensible. Ontologies -- and the related notions of glossaries and taxonomies -- are principally oriented toward data's analytical uses within and across disparate data-store implementations. Framed in Resource Description Format (RDF) and other formats, ontologies are artifacts of analysis geared to semantic query and knowledge discovery. They provide views of the concepts, relations, and rules for a particular area of business information, irrespective of how that information may be stored as data.
In the broader perspective of advanced analytics, ontologies support the following use cases:
Building semantic models: Developers explicitly model semantics as RDF ontologies and/or related logical structures such as taxonomies, thesauri, and topic maps. These ontologies are used to drive the creation of structured content that instantiates the entities, classes, relationships, attributes, and properties defined in the ontologies.
Mediating between heterogeneous semantics: Developers use ontologies and other semantic models to drive the creation of mappings, transformations, and aggregations among existing, structured data sets.
Mining the semantics implicit in unstructured formats: Developers use natural-language processing and pattern-recognition tools to extract the implicit semantics from unstructured text sources.
Managing semantics in a consolidated repository: Application environments require repositories or libraries to manage ontologies and other semantic objects and maintain the rules, policies, service definitions, and other metadata to support the life-cycle management of application semantics.
Governing semantics through comprehensive controls: Application environments require that various controls -- on access, change, versioning, auditing, and so forth -- be applied to ontologies; otherwise, it would be meaningless to refer to them as "controlled vocabularies."
You might regard ontologies as metadata applicable to the deep analytic meaning of data. As such, they provide a key semantic stratum within which all data-driven insights are firmly rooted.
About the Author
James Kobielus is senior director of research for data management at TDWI. He is a veteran industry analyst, consultant, author, speaker, and blogger in analytics and data management. At TDWI he focuses on data management, artificial intelligence, and cloud computing. Previously, Kobielus held positions at Futurum Research, SiliconANGLEWikibon, Forrester Research, Current Analysis, and the Burton Group. He has also served as senior program director, product marketing for big data analytics for IBM, where he was both a subject matter expert and a strategist on thought leadership and content marketing programs targeted at the data science community. You can reach him by email ([email protected]), on Twitter (@jameskobielus), and on LinkedIn (https://www.linkedin.com/in/jameskobielus/).