Question and Answer: How to Improve Performance from Cloud Computing
How to scale the data tier to get better performance.
- By James E. Powell
- February 3, 2010
Cloud computing allows businesses to scale massively without investing in expensive new equipment. Yet cloud architectures will never reach their full potential as long as databases outside the cloud severely restrict scalability. For companies that want to retain their databases and benefit from the cloud's scaling potential, distributed caching is one of the best solutions.
In this interview, we talk with Jeff Hartley, vice president of marketing and products for startup Terracotta, about new ways to scale the data tier for better performance, as well as what should and shouldn't be stored in a cloud database for best performance. Hartley also discusses how caching and clustering technologies help address performance issues.
TDWI: How would you define the private cloud vs. the public cloud?
Jeff Hartley: Generally a public cloud is a service accessed over the Internet, one which runs on hardware that is owned by a third party. Amazon EC2 is one example. In concept, you pay for it as you would pay for an electric utility. Low upfront costs are a nice benefit here.
A private cloud is one in which you operate your own hardware with infrastructure software that gives you some of the efficiencies of cloud computing, such as efficient hardware utilization for uncorrelated workloads. Yet a private cloud also gives you more upfront capital spending to procure the hardware. Although the latter implies more upfront cost, it might be preferable in some cases, depending on the sensitivity of the data involved. Private clouds in some ways remind me of working on corporate mainframes – it's essentially a pool of computing power available, and you use a slice of it.
If we call Terracotta a private cloud provider, what kinds of issues are potential customers coming to you with?
I wouldn't say we're a private cloud provider per se, but we provide solutions that can help clouds scale better, whether they're public or private. In fact, our solutions apply to a wide variety of applications that need a seamless, reliable way to scale, whether they're running in a cloud or not. In the cloud context, what we provide is scalability of the data layer, so that the elastic compute layer enabled by modern virtualization can be efficiently fed the data it needs to consume. This is generally accomplished by allowing organizations to change the balance of their data management workload so that less data is handled in databases and more is handled in the application tier, using distributed caches and other forms of durable shared memory.
The scalability of the data tier in clouds is becoming a top-of-mind issue for CIOs as they assess their cloud strategies. It's clear that without some new approaches to managing application data, we might struggle to realize many of the economic benefits of cloud computing that we've come to expect. The database can become a bottleneck, and we're all familiar with some of the costs of scaling databases to meet increased demand.
Smart planners will think about what types of data their applications in the cloud need, as well as what data belongs in distributed caches, what belongs in the database, what belongs in clustered web sessions, and so forth. For example, do you need to store in the database the fact that for the next five seconds a user is on page five of an eight-page Web workflow, just in case you need to recover the state of the application? Probably not. Should the order that the user places with a site be put into a database for reporting and customer service purposes in the future? Probably yes. The bottom line is this: It's just not efficient to store certain types of data in databases.
You've said that analysts are seeing the rate of inquiry around distributed caching platforms rise rapidly over the past few months. Why is that?
I think people are running into the scalability issues we just discussed and are looking for alternatives to provide more flexible scalability to their applications. The cost of scaling databases further encourages a search for alternatives. Also, from a development, maintenance, and performance perspective, avoiding unnecessary use of databases can be quite beneficial. Object-oriented data from the application tier doesn't need to be mapped into relational form for temporary storage in the database, so you can write applications more quickly. They're also easier to maintain and they can perform better because they don't need to hit the database as much.
How can a business determine when a certain type of data in a cloud is better off with caching and clustering technologies vs. in a database?
Generally, caching and clustering technologies are a complement to the database and not a replacement. If the data is read-only or read-mostly, and it's not needed for long-term reporting and analysis, or if it has a temporary "in flight data" flavor, businesses should consider maintaining that data in a distributed cache, clustered Web sessions, or some other form of durable shared memory.
If you are storing completed business transactions, querying the data down the road is important, and these queries might arise from a number of different applications or ad hoc sources, then the database is the proper place for the data. In many cases, data will go through phases where maintaining it in caching and clustering technologies makes sense until it forms a complete business transaction, whereupon it's written to a database for long-term recordkeeping.
How does distributed caching work regarding transactional vs. analytical applications? Which type of application is distributed caching better for, and why?
Caching and clustering technologies are useful for both analytical and transactional applications. It really depends more on where in your application you can get performance and scalability benefits by reducing the number of hits to the database. A lot of applications out there can certainly benefit from reducing database load.
For an example, let's expand on the earlier point about data layer scalability issues in private clouds. In that example, we talked about a Web application use case where we wanted to reliably track a user's progress through a workflow, in case we needed to recover the state of the user/application conversation in the event of some disruption.
We wanted to know that the user was on page five of an eight page workflow, so that if she clicks the "next" button and the application server she was dealing with before is now no longer in service, a new application server can pick up the context of that conversation and nobody experiences any downtime, or worse, a situation where data needs to be re-entered. This sort of "conversational state" data, which might only have relevance for a few minutes or perhaps just a fraction of a second, is best handled by caching and clustering technologies. This sort of data exists in both analytical and transactional applications, and in both types of applications, pulling this type of data out of a database can significantly improve performance and eliminate scalability bottlenecks.
How should data managers decide which data to store in caching and clustering technologies and which not to?
Data that can be cached because it is read-only or read-mostly is a great candidate for a distributed cache using caching and clustering technologies.
On the other hand, data that is needed for long term analysis, reporting, or regulatory reasons should still go into a database. So, sales orders, shipment data, financial transactions, anything where you might want to run reports on it next week, six months, or even years down the road, should go into a database. However, it might make sense to handle much of this data in caching and clustering technologies until the point where it forms a completed business transaction.
How does Terracotta handle business intelligence applications?
From a BI perspective, reliably handling the conversational state we discussed earlier would be useful whenever a user of a BI system is working though a workflow -- that is, when building reports, setting up a new user profile, or building an ETL process. In these cases, you definitely want to avoid re-work should a problem occur, since rebuilding the workflows would irritate users, but you also want to provide this reliability without undue impact on the responsiveness and scalability of the application. Similar examples are data related to online user state, including: Is the user logged on? Is the user a member of a group that has permissions to schedule this task? Can the user run this report?
Aside from user state, caching "read-only" or "read-mostly" references and the metadata used to run reports is a great way to use caching and clustering technologies. In a BI context, this might mean using caching and clustering technologies to cache read-only dimensional data like states, sales regions, or fiscal reporting periods that users need to "slice and dice" data. This might also apply to storing derived data used to boost performance, such as aggregates.