Executive Perspective: Solving the Big Challenges of Big Data
As your enterprise data volumes grow, you face significant changes. Ted Dunning, HPE's Data Fabric chief technology officer, explains the expected -- and unexpected -- challenges you'll face and how your data management practices will change accordingly.
- By James E. Powell
- June 23, 2020
As your enterprise collects and analyzes more data, your data center will grow in expected ways -- adding more hardware in an ever-growing footprint. However, you'll also face a number of unexpected challenges, according to Ted Dunning, Data Fabric chief technology officer at HPE. Below, he describes how edge and core architectures are affecting your data center, the role of a data fabric, and the impact of AI and machine learning on data management.
Upside: As enterprises start to store petabytes of data (and many are close to crossing the exabyte mark), they're going to discover new challenges. What are some of the challenges they can expect?
Ted Dunning: The issue is that each power of ten in scale changes the problems you face significantly. A useful rule of thumb is that once you change scale by three powers of ten, you usually have problems that are qualitatively different than at the smaller scale. Many companies find this out the hard way.
Some of the problems are highly predictable. Companies will require 10x as many disk drives to store 10x as much data, for instance. That will consume 10x more power, 10x as many enclosures, and 10x as much floor space. Those kinds of problems that can be predicted by simple extrapolation from a system with 100 terabytes up to a system with 100 petabytes are not all that surprising.
Some things that people are surprised by are similarly simple but are still unanticipated. For instance, when your data size increases dramatically, it is common for the number of files (or tables or message stream topics) to increase by a similar amount. This is because the average size of each of these objects doesn't increase or because the data size increase is due to dealing with more physical objects in the world and each of these produces more of these data objects.
The problem comes when the teams implementing these systems realize that they may have committed to a technology base that is too limited to handle really large scale. If their architecture and growth commits them to needing a billion files, they may be fine, but if this increases to a trillion files, most data systems are going to simply fall over.
It isn't uncommon for technology vendors to say "no limits" on some parameter such as file or topic count when they should say, "We haven't ever built a really large system and have no idea what will break." More than once I have helped picked up the pieces when one of our clients had to make an unexpected technology swap to get past scaling limits.
What are some of the challenges enterprises don't expect?
One of the biggest is that they don't understand the effect scaling their data systems will have on their teams, their development process, and their operational processes.
For instance, suppose that somebody decides that having a limited number of files isn't such a big problem because they can just stand up another cluster every time they get a few hundred million. The real problems here are that such a strategy implies a hidden tax on the development process and on the operational budget.
The development process suffers because there is a creeping complexity in figuring out how to address files in an ever-growing number of clusters. The operational impact is that the administrative cost of maintaining clustered systems is typically nearly proportional to the number of systems you are maintaining, not the size of each one. That means that as their data systems grow, they will be devoting a larger and larger share of their mind to simply dealing with that growth.
The result is usually a crunch and it is usually a very bad crunch. Eventually, the technical debt imposed by using sleazy workarounds to deal with data growth causes technical bankruptcy and that can take an unfortunately dramatic form.
Another challenge enterprises commonly fail to anticipate is that data growth is often caused by collecting more kinds of data from more things/customers/factories/regional centers near the edge. That often implies a dramatic increase in the scope and complexity of something as simple as retrieving telemetry from all of these dispersed data systems and getting it into one place where some system can take a global view of what is working and what isn't.
A great example of this is the 5G rollout that is going on right now in lots of places all over the world. The 5G towers are much more advanced than the 4G towers in terms of controlling adaptive beamforming and many other tricks that dramatically increase the bandwidth consumers can use. As a side effect, these systems need much more complex monitoring with intelligence near the edge to make sense of it all as well as the ability to move that telemetry relatively transparently. We are seeing lots of people who are very surprised at how hard telemetry can be at scale and who need help building solutions.
A related problem is in security analytics. Here, the telemetry is much less focused on how things are working and much more focused on determining when somebody is trying to make them not work correctly (and thus do something that we would rather they didn't). Because hackers are typically trying to figure out some new attack, we can't really specify all the kinds of data we will need to have in order to catch them. Unlike operational telemetry, where we probably understand many (if not most) potential failure modes, we definitely can't pretend that we understand all the potential ways attackers will strike at our systems in the future. That makes the security telemetry problem much more challenging.
Finally, many teams fail to account for how scaling geographically is often just as important as scaling in terms of data volume. Because legacy systems are often designed to work entirely within the confines of a single data center, such geodistribution can be very hard to attain if you start without a good data fabric as the underlying foundation of your system.
What are the implications of edge and core architectures when it comes to the traditional data center?
One of the most important implications of edge and core architectures is that the idea of a single data center as a unitary entity is not really valid any more. The data center is a very efficient way to host a large amount of cross-connecting computations that share data and likely communicate extensively with each other. That is still a key mission.
However, it is important to remember that little business is conducted in a data center. The value of a business is generated at the interface with customers and suppliers. More and more, these interfaces generate data and require computation to occur near the point of that generation to meet latency and reliability requirements. Such local action, however, needs to be augmented with the recognition of patterns that exist across the entire business. Such patterns can only be dealt with centrally.
Thus, it is necessary to act locally but learn globally. Because of the rapidly increasing computational cost of such global learning, the data center is the natural place to do it.
Are there any specific technologies (such as AI or machine learning) that can help an enterprise manage this much data? If ML can help, don't such great data volumes mean enterprises have to use data samples rather than the full data set, which can negatively affect ML-generated algorithms?
AI and machine learning are often the rationale for building really large data sets in the first place. Without autonomous vehicle development's voracious appetite for real-world data (it is, at its core, a massive exercise in machine learning), carmakers would not need to record several gigabytes of data per second from each of hundreds or thousands of vehicles around the world -- resulting in hundreds of petabytes of data even after very selective retention.
On the other hand, machine learning can often allow substantial optimization of the hardware and software used to store these massive amounts of data. This can decrease the amount of data that needs to be retained by several orders of magnitude. In fact, it is just this optimization behind the fact that such autonomous vehicle development systems only have to store hundreds of petabytes. Without the optimization, they might well require dozens or hundreds of exabytes of data which would make the current developments simply infeasible.
The key is that much of the data being captured at the edge is really much the same as data that has been captured previously. Determining which data is boring isn't easy to determine, however. As the machine learning systems under development increase in sophistication, they can be used to build systems that know interesting data when they see it. These can run near where the data originates so the most interesting bits are brought back to the core for more extensive use.
Other forms of data management can benefit from machine learning as well. For instance, really large data systems typically have dozens, hundreds, or even thousands of developers, data engineers, and data scientists working on them. As the number increases above just a few, it is common for them to begin to repeat basic tasks and analyses. For an example so basic as to be silly, multiple data scientists building marketing models might all incorporate a feature into their models that embodies the idea that a potential customer appears to be younger (or older) than is typical in the place they live. That might signal some interesting propensity to behave in different ways. Having all these people develop the same or nearly the same feature is a waste of effort, but if there are hundreds of data engineers or data scientists (or even more) building features, it becomes impossible to keep track of all of the bright ideas the team has.
Machine learning, however, can be used to suggest such commonly useful features or analyses to data scientists as they work based on the shape of what they seem to be building. To the extent that such suggestions are helpful and are incorporated into models, that can be used to signal further recommendations. Moreover, such a recommendation engine for data science would have the desirable effect of increasing the visibility into what kinds of data and computations have the most value to the company.
What is the role of a data fabric in handling large amounts of data?
The core role of a data fabric is to enable our clients to perform the right computation resulting in the right business action at the right time, in the right place, and with the right data. Moreover, this has to be done while essentially standing in the shadows, without making the core mission of doing the right computation and taking the right action any harder. This means that a data fabric needs, if it is to succeed in this supporting role, to orchestrate data motion, security, durability, and access in a way that is as simple as possible.
One aspect of that simplicity is that the data fabric must allow concerns to be separated. At the point of data generation, for instance, whoever is generating that data shouldn't need to think about how or where or when that data will be consumed. Conversely, at the point of consumption, an analytics program should be able to see all of the necessary data from anywhere without additional complexity to deal with data motion. In between, the person responsible administratively for making any necessary data motion happen should not care about the content of the data.
Such a separation of concerns allows each of the people working on these separate aspects of the overall problem to focus on succeeding on a particular part of the problem and not worry about other parts.
Another role of the data fabric is to foster multitenancy. At large scales, it is critical that many applications be able to cooperatively work on shared large-scale data structures. One reason is cost. At scale, we simply can't afford to give everybody their own enormous system. There are much less obvious reasons, however. For instance, the first application on a typical large-scale data system is often a stolid one that has a low chance of failure paired with little chance of dramatic and unexpectedly large value.
On the other hand, with strong multitenancy, subsequent projects can often come in on the coattails of the first few projects and be much more speculative in nature. That means that they can have a high probability of failure (because the major costs were borne by the first few projects), leading to a much higher chance of unexpectedly positive results. Such unexpected results often come from the unanticipated combination of large data assets that were originally built for different projects. Without good multitenancy and the ability to share data sets without compromising existing support-level agreements, such cross-pollination could never occur.
What are the top three best practices you recommend as enterprises see their data storage needs grow?
To survive the scale of data they will face in the near and mid-term future, enterprises absolutely must
- Future-proof their data handling by adopting incremental data modeling techniques
- Level up from the idea of provisioning storage for VMs to the idea of providing a scalable data fabric for both VM-based applications and container-based applications
- Use multitenancy and geodistribution to maximize use of data wherever and whenever it is found
Describe HPE's product/solution and the problem it solves for enterprises.
HPE Container Platform and HPE Data Fabric software solutions help enterprises address the challenges of data-intensive applications at scale. HPE Container Platform makes it easier to develop and deploy containerized applications using Kubernetes, including data-intensive applications for AI/ML and analytics use cases. HPE Data Fabric allows you to store and manage data from small to large, in one location or in a geodistributed fashion. Together, they let you do the right computation with the right data, in the right place, at the right time -- whether in your data center, in any public cloud, or at the edge.
[Editor's note: Ted Dunning is chief technologist officer for Data Fabric at Hewlett Packard Enterprise. He has a Ph.D. in computer science and is an author of over 10 books focused on data sciences. He has over 25 patents in advanced computing and plays the mandolin and guitar, both poorly. You can contact the author at firstname.lastname@example.org.]