The Big Data Pendulum Swings from Centralized to Distributed Workloads
As businesses try to improve how they leverage data and increase their competitive agility, a converged data platform will be key to their success.
- By Jack Norris
- March 1, 2016
Tech cycles have swung back and forth over the years, from centralized to distributed workloads. When it comes to big data, organizations typically focus their efforts on centralized data lakes. The benefits of a centralized data lake lie in stark contrast with the alternative -- the operation of separate data silos.
Of course, the benefits of centralization are numerous, including reduced data duplication, simplified management, and robust applications that benefit from disparate data feeds. Although there are many benefits to data centralization, large organizations will increasingly move to distributed processes for big data in 2016 in order to properly manage across devices, data centers, and global use cases.
Let's take a look at some of the challenges and issues that organizations encounter with centralized workloads. Disparate workloads, for example, can prompt the need for separate processing clusters. Database applications are typically run on a separate cluster from Hadoop to avoid conflicts and to make them easier to manage. Organizations looking to take advantage of streaming analytics with open source technologies such as Spark or Storm will need to deploy additional clusters to handle streaming data and to coordinate separate data feeds to Spark (for streaming analytics) and Hadoop (for batch analysis).
In addition to workload obstacles, disparate users and groups can dictate the separation of clusters. As data access permissions and concerns regarding data privacy and protection mount, organizations are often forced to deploy separate physical clusters unless their platform has multi-tenancy capabilities that can properly provide the privacy and logical data separation.
Data gravity is another key driver. Many companies distribute processing workloads across multiple data centers in separate geographic locations. In addition to speed, the need for local processing is often driven by government regulations such as safe harbor privacy provisions. These provisions drive companies to separate the storage and processing of user data and to clearly define the acceptable borders for the processing of that data.
Emerging technology trends will further push the need for distributed processing. According to a recent study by Cisco, the Internet of Things (IoT) will result in over 50 billion devices by 2020. The data emitted by these devices needs to be collected, processed, and analyzed. The best demonstrated practices for IoT will consist of distributed processing, selective filtering, and aggregation of data that is then transmitted to various locations.
The Emergence of a Converged Architecture Approach
Distributed processing of big data will increasingly be required, but organizations will not need to make an "either/or" decision when it comes to centralization. As the pendulum swings to distributed processing, a centralized and converged architecture approach will rise in prominence. Although seemingly an oxymoron, a distributed converged architecture is a natural outgrowth of big data evolution. As big data usage evolves and scales, local processing requirements increase. As more applications are deployed, different workloads can conflict and impact performance and job completion times.
A converged data approach addresses the challenges of evolving big data and applications deployments. First, a converged platform manages disparate workloads without impacting performance. Second, a converged platform provides full multi-tenancy features that provide logical separation of data, job execution, and end-user access. Finally, to accommodate execution across remote sites, a converged data platform supports distributed processing with a logically centralized architecture.
What exactly does this mean? Let's look at a global automated advertising platform provider. This company provides an auction exchange for real-time bidding where advertisers buy and sell online ad impressions. This company has six data centers distributed across the world to provide the necessary performance to accomplish regional ad auctions. Information, however, needs to be shared across all locations for management purposes and to better serve customers. Global customers need to understand how ads are performing in individual regions, but they also need to understand how campaigns perform globally. In other words, information needs to be logically centralized to provide transparency and visibility. Customers need to understand regional ad performance and compare costs across regions so they can quickly adjust spending and shift priorities based on results.
Logical centralization also extends to the administration and control of a distributed cluster. How data is accessed, protected, and managed needs to be centralized. Features such as wide area replication are required to ensure that data in each remote location has full disaster recovery protection. Wide area replication can also provide synchronization across sites to support real-time reporting and dashboards to manage business results and processing.
A converged data platform includes core features that address the big data challenges organizations are facing today; it integrates file, database, stream processing, and analytics and delivers these benefits across a diverse set of applications and workloads.
As tech cycles continue to swing, successful organizations will reap the benefits of centralized big data while addressing the challenges of diverse and distributed workloads, users, locations, and regulations. As businesses continue to look for ways to improve their ability to leverage data and increase their competitive agility, a converged data platform -- which gives them the ability to process data locally and leverage it globally -- will be the foundation for success.
Jack Norris, chief marketing officer at MapR Technologies, has over 20 years of enterprise software marketing experience. He has demonstrated success from defining new markets for small companies to increasing sales of new products for large public companies. Jack’s broad experience includes launching and establishing analytic, virtualization, and storage companies and leading marketing and business development for an early-stage cloud storage software provider. Jack has also held senior executive roles with EMC, Rainfinity (now EMC), Brio Technology, SQRIBE, and Bain and Company. Jack earned an MBA from UCLA Anderson and a BA in Economics with honors and distinction from Stanford University.