Leveraging Tiered Data Analytics Approaches for Better Cloud Utilization
We explore how to leverage tiered data architecture when leveraging cloud-based platforms.
By David S. Linthicum
Most of those in the world of BI push back on the notion that cloud computing is a game-changer for data analytics. It's really more of a platform change. However, it's a very scalable and more economical data storage and compute service, in both private and public instantiations.
The problems arise when it comes to traditional BI or even newer big data systems. Those building these systems don't understand the value that cloud computing may (or may not) bring. Moreover, data security issues are a bit scary, as is the need to deal with ever-changing compliance issues. Thus, they enterprises seek more flexibility and choice when moving toward cloud computing.
An approach that has value is to leverage tiered data architecture when leveraging cloud-based platforms. We've been dealing with tiered data for years, specifically, leveraging different types of data storage based upon access frequency. For example, frequently accessed data would reside on higher-priced DASD,; as it aged (and was accessed less often), the data would be moved to secondary storage (perhaps optical drives) and then off to cheaper tape storage. This was also called hierarchical storage among other names.
The tiered-data approach became largely impractical, as the cost of DASD-based mass storage systems fell to a point where optical drives and tape drives no longer made economic sense. However, the use of the same tiered approach to storing and managing analytical data within traditional systems, private clouds, and public clouds may, indeed, make sense for a few innovative enterprises.
The idea is that data storage systems do not need to exist on a single platform, such as traditional local data storage, within a private cloud, or on a public cloud. Instead, we can partition the data to reside on tiered physical or virtual servers that exist on traditional servers, within private clouds or virtualized infrastructure (there is a difference), or within a public cloud provider.
In some instances, you're placing the same data on each tier. For the most part, it's about partitioning different data by entity, on each tier. For instance, let's name our tiers:
- Local Database on Dedicated Server
- Database and/or Hadoop Distributed File System on a Private Cloud
- Database and/or Hadoop Distributed File System on a Public Cloud
In terms of analytical data, Tier 0 may contain data that has specific security and compliance issues that must be considered. For instance, Tier 0 could be used to store patient data with PII; it may be too risky to place such data on a private cloud with company-wide access or on a public cloud that is out of your direct control.
The tradeoff with placing data on Tier 0 is that there is no additional capacity on demand as there is on the private cloud (Tier 1) -- which is limited to the physical servers within that cloud -- or within the public cloud (Tier 2), which provides huge data processing capacity on demand.
On Tier 1, you can place data that may not have the same security or compliance issues -- perhaps data that is not a good fit for public clouds. In some cases, this could be due to security or compliance issues. Performance may also be an issue, considering the latency in sending large result sets over the open Internet when leveraging a public cloud.
Of course, Tier 2 provides the most expandability and flexibility. You can provision resources as you need them and thus scale up to a huge analytical processing load. For larger compute- and storage-intensive analytical processes, the public cloud will be a better fit.
The key to this approach is to bring the tiers together to form a complete analytical solution. Typically this means leveraging the data contained in each tier as sets of data services that provide a consistent mechanism for access. Moreover, these data services are typically built around an entity, such as patient data services, that we would like to remain on Tier 0. Moving on, perhaps treatment data services will exist on Tier 1, and massive amounts of outcome-based data, fronted by data services, will be best suited to Tier 2.
Moving up, we then build the analytical services on top of the data services exposed from Tiers 0-2, and those are the services that provide the data analytics services that may leverage one, two, or three tiers in order to produce the proper analytics. Moreover, you can relocate the data as required in the future with only slight modifications to the analytical services.
Of course, it is important to consider both security and governance in the creation of this solution. Consider the complex and distributed nature of this architecture. Also, there is no hard and fast rule as to how many tiers you employ. Indeed, you can mix and match the architectural concepts provided in this article to meet the exact analytical requirements of your business.
The idea here is to provide the best flexibility and agility possible around your data analytics requirements through the use of several different platforms, which may service different types of data. Also, the ability to place the right data on the right platform, including the utilization of private and public cloud computing.
The downside of this is complexity. Although there are databases that support the use of tiers, even tiers on cloud providers, largely this is something you'll have to create and maintain. The upside is flexibility and expandability, which leads to additional agility around the application of core data analytics services. The ROI could be pretty quick if done correctly.
David S. Linthicum is a big data and cloud computing expert and consultant. He is the author or co-author of 13 books on computing, including Enterprise Application Integration (Addison Wesley). You can contact the author at www.davidlinthicum.com.