TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Smarter Marketing in Retail: How AI and Modern Data Foundation Drive Growth July 31, 2025
  - Expert Panel: What's Next in Data Integration: Powering the AI-Driven Enterprise August 25, 2025
  - Expert Panel: Improving Data Quality, Accuracy, and Consistency August 27, 2025
  - Expert Panel: Building an AI-Driven Data Strategy September 15, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Data Governance Week July 30, 2025
  - Platforms & Architecture Week July 30, 2025
  - AI Bootcamp Week July 30, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

RESEARCH & RESOURCES

Question and Answer: How to Improve Performance from Cloud Computing

How to scale the data tier to get better performance.

By James E. Powell
February 3, 2010

Cloud computing allows businesses to scale massively without investing in expensive new equipment. Yet cloud architectures will never reach their full potential as long as databases outside the cloud severely restrict scalability. For companies that want to retain their databases and benefit from the cloud's scaling potential, distributed caching is one of the best solutions.

In this interview, we talk with Jeff Hartley, vice president of marketing and products for startup Terracotta, about new ways to scale the data tier for better performance, as well as what should and shouldn't be stored in a cloud database for best performance. Hartley also discusses how caching and clustering technologies help address performance issues.

TDWI: How would you define the private cloud vs. the public cloud?

Jeff Hartley: Generally a public cloud is a service accessed over the Internet, one which runs on hardware that is owned by a third party. Amazon EC2 is one example. In concept, you pay for it as you would pay for an electric utility. Low upfront costs are a nice benefit here.

A private cloud is one in which you operate your own hardware with infrastructure software that gives you some of the efficiencies of cloud computing, such as efficient hardware utilization for uncorrelated workloads. Yet a private cloud also gives you more upfront capital spending to procure the hardware. Although the latter implies more upfront cost, it might be preferable in some cases, depending on the sensitivity of the data involved. Private clouds in some ways remind me of working on corporate mainframes – it's essentially a pool of computing power available, and you use a slice of it.

If we call Terracotta a private cloud provider, what kinds of issues are potential customers coming to you with?

I wouldn't say we're a private cloud provider per se, but we provide solutions that can help clouds scale better, whether they're public or private. In fact, our solutions apply to a wide variety of applications that need a seamless, reliable way to scale, whether they're running in a cloud or not. In the cloud context, what we provide is scalability of the data layer, so that the elastic compute layer enabled by modern virtualization can be efficiently fed the data it needs to consume. This is generally accomplished by allowing organizations to change the balance of their data management workload so that less data is handled in databases and more is handled in the application tier, using distributed caches and other forms of durable shared memory.

The scalability of the data tier in clouds is becoming a top-of-mind issue for CIOs as they assess their cloud strategies. It's clear that without some new approaches to managing application data, we might struggle to realize many of the economic benefits of cloud computing that we've come to expect. The database can become a bottleneck, and we're all familiar with some of the costs of scaling databases to meet increased demand.

Smart planners will think about what types of data their applications in the cloud need, as well as what data belongs in distributed caches, what belongs in the database, what belongs in clustered web sessions, and so forth. For example, do you need to store in the database the fact that for the next five seconds a user is on page five of an eight-page Web workflow, just in case you need to recover the state of the application? Probably not. Should the order that the user places with a site be put into a database for reporting and customer service purposes in the future? Probably yes. The bottom line is this: It's just not efficient to store certain types of data in databases.

You've said that analysts are seeing the rate of inquiry around distributed caching platforms rise rapidly over the past few months. Why is that?

I think people are running into the scalability issues we just discussed and are looking for alternatives to provide more flexible scalability to their applications. The cost of scaling databases further encourages a search for alternatives. Also, from a development, maintenance, and performance perspective, avoiding unnecessary use of databases can be quite beneficial. Object-oriented data from the application tier doesn't need to be mapped into relational form for temporary storage in the database, so you can write applications more quickly. They're also easier to maintain and they can perform better because they don't need to hit the database as much.

How can a business determine when a certain type of data in a cloud is better off with caching and clustering technologies vs. in a database?

Generally, caching and clustering technologies are a complement to the database and not a replacement. If the data is read-only or read-mostly, and it's not needed for long-term reporting and analysis, or if it has a temporary "in flight data" flavor, businesses should consider maintaining that data in a distributed cache, clustered Web sessions, or some other form of durable shared memory.

If you are storing completed business transactions, querying the data down the road is important, and these queries might arise from a number of different applications or ad hoc sources, then the database is the proper place for the data. In many cases, data will go through phases where maintaining it in caching and clustering technologies makes sense until it forms a complete business transaction, whereupon it's written to a database for long-term recordkeeping.

How does distributed caching work regarding transactional vs. analytical applications? Which type of application is distributed caching better for, and why?

Caching and clustering technologies are useful for both analytical and transactional applications. It really depends more on where in your application you can get performance and scalability benefits by reducing the number of hits to the database. A lot of applications out there can certainly benefit from reducing database load.

For an example, let's expand on the earlier point about data layer scalability issues in private clouds. In that example, we talked about a Web application use case where we wanted to reliably track a user's progress through a workflow, in case we needed to recover the state of the user/application conversation in the event of some disruption.

We wanted to know that the user was on page five of an eight page workflow, so that if she clicks the "next" button and the application server she was dealing with before is now no longer in service, a new application server can pick up the context of that conversation and nobody experiences any downtime, or worse, a situation where data needs to be re-entered. This sort of "conversational state" data, which might only have relevance for a few minutes or perhaps just a fraction of a second, is best handled by caching and clustering technologies. This sort of data exists in both analytical and transactional applications, and in both types of applications, pulling this type of data out of a database can significantly improve performance and eliminate scalability bottlenecks.

How should data managers decide which data to store in caching and clustering technologies and which not to?

Data that can be cached because it is read-only or read-mostly is a great candidate for a distributed cache using caching and clustering technologies.

On the other hand, data that is needed for long term analysis, reporting, or regulatory reasons should still go into a database. So, sales orders, shipment data, financial transactions, anything where you might want to run reports on it next week, six months, or even years down the road, should go into a database. However, it might make sense to handle much of this data in caching and clustering technologies until the point where it forms a completed business transaction.

How does Terracotta handle business intelligence applications?

From a BI perspective, reliably handling the conversational state we discussed earlier would be useful whenever a user of a BI system is working though a workflow -- that is, when building reports, setting up a new user profile, or building an ETL process. In these cases, you definitely want to avoid re-work should a problem occur, since rebuilding the workflows would irritate users, but you also want to provide this reliability without undue impact on the responsiveness and scalability of the application. Similar examples are data related to online user state, including: Is the user logged on? Is the user a member of a group that has permissions to schedule this task? Can the user run this report?

Aside from user state, caching "read-only" or "read-mostly" references and the metadata used to run reports is a great way to use caching and clustering technologies. In a BI context, this might mean using caching and clustering technologies to cache read-only dimensional data like states, sales regions, or fiscal reporting periods that users need to "slice and dice" data. This might also apply to storing derived data used to boost performance, such as aggregates.

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.

Learn More

↑

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

RESEARCH & RESOURCES

Question and Answer: How to Improve Performance from Cloud Computing

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

TDWI

Engage

Research