TDWI Articles

The Evolution of Data Lakes and Data Platforms: Migrating to the Cloud

Moving to a cloud-based data platform is a tall task, especially with so much data at stake. These tips can help.

Data lakes have quickly become mainstream, growing in popularity and prevalence, as businesses realize how they help solve scalability and duplication issues while significantly enhancing analytics insights to gain a competitive advantage. In addition, the diversity of usage and inherent ability to house various types of data makes data lakes critical for digital business. However, the data platforms and data management practices associated with data lakes are rapidly evolving

For Further Reading:

Data Lake Platform Modernization: 4 New Directions

Trustworthy Data: The Goal of Data Quality and Governance

Data Management Best Practices for Cloud and Hybrid Architectures

Data platforms such as on-premises Hadoop are commonly used for data lakes, but a shift is underway. Organizations are starting to transition to cloud-based data platforms such as Amazon Web Services (AWS), Microsoft Azure, Snowflake, Google Big Query, Intellicloud, etc. to meet modern data lake requirements. They offer easier, more affordable, and flexible data platforms compared to on-premises Hadoop.

Confronting Modern Data Lake Needs

Data lakes come in a variety of shapes and sizes. Many businesses have chosen to house their data lakes on Hadoop Distributed File System (HDFS) to store large files across the Hadoop cluster. Still, some have multiple data lakes outside of HDFS and need to process data in storage systems, such as Amazon Simple Storage Service (S3), for different data initiatives.

As organizations look to leverage additional data platforms and shift from viewing data as a static resource to data in motion, business will continue to phase out their use of the Hadoop platform in favor of alternatives with broader applicability and flexibility. Cloud-based data platform use cases stretch beyond just a searchable distributed file system for unstructured data, a structure for batch data transformation, and a system for extracting value from data volume, variety, and velocity.

Reaping the Rewards of Cloud-Based Data Platforms

Cloud-based data platforms reduce costs and scale elastically compared to the Hadoop data platform on premises (which stores data in one HDFS for availability). In addition, the Hadoop data platform requires data users to process and add value to their data sets manually. Cloud-hosted platforms allow users to automatically add value to data sets with features such as de-duplication, data quality scores, and a standardized enterprise data exchange. In addition, modern cloud systems allow businesses to collect and store data from multiple sources organized for distribution, sharing, and advanced analytics.

Cloud-based data platforms represent a consolidated system for data within an enterprise, minimizing data requests from business users to IT. However, switching to a cloud-based data platform isn’t a simple process. It requires comprehensive data governance. By governing data from multiple systems and from multiple repositories, organizations can create a complete view of their data landscape, allowing users to define and easily understand data and associated business terms, quickly track data lineage and efficiently manage all aspects of their data assets.

Emphasizing Data Governance for Cloud-Based Data Platforms

Cloud-based data platforms offer desirable benefits for an enterprise, including a sharable data platform that minimizes data requests for IT from business users, enables user self-service, and provides a good home big data and other new data assets. However, switching to a cloud-based data platform isn’t a simple process. It requires comprehensive data governance of multiple systems and from multiple repositories, so that organizations can create a complete view of their data landscape, allowing users to define and easily understand data and associated business terms, quickly track data lineage, and efficiently manage all aspects of their data assets.

When an organization transitions to a cloud-based data platform, data governance is a must. While migrating to a cloud-based platform, it’s important to define what all their data means, where it comes from and what kind of transformation needs to happen throughout `any data lakes or other data storage systems. If the data is not governed at the same time, all the associated metadata involved quickly grows outdated. Data governance captures and curates the metadata from all data lakes and data storage systems while your organization transitions to a cloud-based platform.

Moving to a cloud-based data platform is a tall task, especially with so much data at stake. With a comprehensive data management infrastructure and team in place, guided by a compliance-driven data governance program, enterprises can proactively manage and mitigate data issues and solve any data problems before migrating data for leveraged use on a new and improved cloud-based data platform.

About the Author

Emily Washington is the executive vice president of product management at Infogix, where she is responsible for driving product strategy, product road maps, product marketing, and vertical solution initiatives. Since joining Infogix in 2002, Emily has worked closely with product development teams and customers to drive introduction and adoption of all new products. You can contact the author via LinkedIn.


TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.