TDWI Upside - Where Data Means Business

Technology Swaps: Why You Must Unlearn What You Think You Know

Are you considering swapping one technology or solution for another? That may be a wise move, but be sure you aren’t basing your decision on dangerous (and mistaken) assumptions.

The IT market is facing a big change in technologies used to process, store, and manage data. It seems like everybody is talking about using new databases and platforms for different tasks. Terms such as “polyglot persistence” are being bandied about to describe the “best-of-breed” approach where many different databases are used to manage data across the organization.

Many of the vendors would like you to believe that changing one technology for another is easy. The reality can be far different because there are tradeoffs made in adopting these different technologies, tradeoffs that may not be apparent at the outset.

Swapping technology A for technology B is based on the assumption that what you're dealing with is (1) a technology problem that (2) can be substituted, like for like, and in the process possibly confer additional benefits. Sometimes this works, as when organizations swapped out their old hierarchical and network databases on for relational databases. Sometimes it doesn’t, as when organizations tried to trade relational for object databases (a technology that rose and crashed in the mid-90s).

Tech swapping is the right response when you have a problem and are working with a similar type of technology -- for example, swapping a SQLServer RDBMS for an Oracle RDBMS. It's also the right decision when you're making a class change -- for example, you're swapping out an overmatched, traditional database for a massively parallel processing (MPP) system. In this case, as with the prior like-to-like example, the basic principles are the same: you're just shifting to a parallel relational database that's optimized for query-processing performance rather than a general-purpose, non-MPP relational database.

Tech swapping can also work if you're swapping out a better-suited but totally different type of technology. Imagine swapping out the costly Oracle database that's powering your under-performing website for a NoSQL database such as Cassandra. Imagine a similar swap, albeit one that involves replacing one RDBMS (Oracle) with another -- namely, a sharded MySQL database. (“Sharding” MySQL involves breaking up or distributing the data in one database into multiple databases across multiple computers. The term “sharding” comes from the pieces of glass, or shards, from breaking a mirror, a play on words to do with mirroring databases for read performance.)

Here there be dragons. When you make this change, you run afoul of the things you don’t know that you don’t know. You first discover that there are, in fact, things that you don't know that you don't know. Second, you learn that some of what you “know” for one technology type won’t help you with the new technology. In fact, your intuition developed from years of experience may tell you the exact opposite of what you need to do. In other words, you must relearn and more important, to unlearn.

For example, if you are experienced at data modeling in an RDBMS world, your ideas about how to organize data for performance or to make change easier will be very different from what is needed in most NoSQL databases. Best practices in building data models for SQLServer or Teradata are not all that different, but they can be bad practices in a different type of database such as Cassandra.

Changing from one type of technology to another has deeper and broader repercussions. Swapping in new, dissimilar technology affects more than simple technical interfaces. There are different development techniques and different management practices. Fundamentally, change of this kind affects the architecture of your systems.

The most common mistake people make with technology procurement is failing to recognize when they're contemplating a change that will affect the architecture of the system they are managing. Exhibit A is when an organization decides to replace a database with Hadoop.

This can be a good idea, as when you need to support analytics model building and execution. This workload usually means there are a smaller number of users, but those users may read -- and more important, write back -- enormous volumes of data. The algorithms they use are often iterative, reading, calculating, and then re-reading and re-calculating, all that data. Contrast that with the workload of business intelligence system. These systems tend to have more users, reading but never writing data, in a single pass with no iteration. This workload is what parallel relational databases were designed (some might say perfected) to run.

Moving this workload to Hadoop is fraught with difficulties because the design and management techniques you have learned in the relational world do not always apply. The components don’t work the same and the dependencies between tools are changed, sometimes in obvious ways and sometimes in hidden ways that are only uncovered at the most inconvenient time.

You should approach a project that is framed as a simple substitution of one product for another with caution. Learn how the new technology works and what the underlying differences to your existing technology mean. I’ve been using databases as examples, but this applies to any sort of technology.

Identify and list the tradeoffs that each of your choices makes. In the process, you may uncover things you didn’t know about what you already have. What tradeoffs are good or bad for your use? Cross-reference these and be sure to look at the secondary impacts.

For example, a flexible schema (or schema-on-read) is great for some purposes, but the tradeoff it makes is to move the enforcement of data conformance and quality to the application. This has far-reaching implications to the architecture of the application and any downstream system that might use its data. Because the data is not guaranteed to be in the correct fields, with the correct data types and without problems (such as missing values), it is the responsibility of any consuming application to address those when the data is read. Sometimes this is important, as with BI systems, and sometimes it isn’t, as with analytics model building (because that involves data preparation unique to each model and choice of data).

As with any design decision, the trick is to start with the goal and the problem(s) you’re trying to solve. If your problem is performance, you may have a simple technology problem. In this case, tech swapping -- swapping in Tech B, a new and unknown thing of a different type for Tech A, your existing solution -- could be a mistake. I don't mean to sound like a grouch, but people do this all of the time. They see that Tech A is slow, unresponsive, and can't support high levels of concurrency. They see that Tech B is used for big things by big serious companies, can scale to huge numbers of nodes, and is said to support high concurrency. Therefore B should replace A.

What they don't consider is that their issues may be a function of poor system design or under-provisioned resources, still the two most common sources of performance problems in the BI market.

When someone gives you a recommendation to try a new technology, look at it carefully. (Look at it with especial care if the recommendation came from a senior executive or one of your internal application developers.) In a well-understood market with the same types and classes of technologies, this should be a relatively easy decision.

When it involves a recommendation for a technology of a similar type but different class, the decision is harder. A good example of this is a database optimized for an OLTP workload versus one optimized for BI workloads. When it's a recommendation for a technology of a completely different type, you need to really focus on the tradeoffs it makes and what problems it has chosen to favor over others. This has implications to the other areas of your architecture.

Pay special attention to the second hardest problem: that you don’t know what you don’t know. The hardest problem? As Mark Twain said, “It ain't what you don't know that gets you into trouble. It's what you know for sure that just ain't so.”

About the Author

Mark Madsen is the global head of architecture for Think Big Analytics where he is responsible for the consultants who help companies plan and build large-scale analytics infrastructure. Prior to this Mark was president of Third Nature, where he advised companies on data strategy and technology for data science and analytics.

Mark spent most of the past 25 years working in the analytics field, starting with AI at the University of Pittsburgh and autonomous robotics at Carnegie Mellon University. He is also involved with emerging technology as a researcher, sits on the O’Reilly Strata conference committee, chairs the Accelerate data science conference, is on the faculty of TDWI, and is a member of the Data Engineering and Science Council.

Get to Know Mark Madsen

An Interview with Mark Madsen

Agile BI: Re-architecting BI Means Understanding Methodologies


TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, & Team memberships available.