TDWI Articles

Four Data Preparation Trends to Watch in 2019

From privacy to pricing, scalability to self-driving technology, 2019 will be a crucial year for data prep advancements.

As 2018 draws to a close, it is a perfect time to look back at the past year and reflect on the trends that emerged this year and take note of what's next in the coming one. Although not a new technology, the data preparation (DP) industry is evolving rapidly and is now considered a vital organizational competency in a world where the winners and losers are dependent on the speed and quality of their data and analytical processes. Before we look at what's to come in 2019, it makes sense to look at what was accomplished this year.

For Further Reading:

Accessible Data Preparation: 6 Data Quality Tips

Q&A: An Introduction to Self-Service Data Prep

3 Best Practices for Implementing Self-Service Data Preparation

2018: Recognition of Data Prep as a Key Part of a Modern Data Architecture

One of the key milestones for the data prep market in 2018 was its recognition as a key component to transforming data into information on demand to support analytics and modern approaches. Numerous analysts included data prep in their overall modern data architecture frameworks to support activity.

Intelligent data ingestion and processing. Data prep tools got smarter. By using artificial intelligence (AI) and machine learning (ML), DP can dynamically read, interpret, and flatten complex data structures to ease traditional data preparation workflows. This is critical because the number of data sources and formats needed to harness analytics and data science initiatives are expanding. Also notable was the explosion in exploratory versus operational analysis. Although predefined operational reports and dashboards remain, there is a massive appetite for exploratory styles of analysis where the questions are not predefined.

One-step data profiling. Data prep tools were increasingly used to be the "first eyes on data" in data lakes and other data repositories that are poorly described or documented. This enables business consumers to easily find their data and deeply understand what is in the data and what it means, which is the first step to better data outcomes.

Collaborative data prep. Most of the earlier data prep efforts were based on individual people working with data in Excel, Access, or some other desktop tool, but 2018 marked the point where data prep became a team sport. Today, data sets and recipes are shared with others, which enable peers to collaboratively develop and review data prep projects using Google Sheets-style joint editing.

Cloud data prep takes center stage. As data prep becomes mission critical, the need to move it to the most powerful and trusted infrastructure became a key enabler. Cloud has emerged for most organizations as the go-to platform for data projects. The data center of gravity is no longer on premises but in the cloud. More data lakes are moving to the cloud, and Snowflake and other cloud native technologies are removing the traditional on-premises EDW.

2019 and Beyond: What Lies Ahead for Data Prep

The next few years will be pivotal for data technologies to mature and massively move into the cloud in every way. Here are a few trends to keep on your radar.

The move towards consumption-based pricing models for data architectures and technologies. Pricing models come and go with industry cycles, but we are on the precipice of one of the most interesting trends in recent years -- the move to consumption-based pricing. Thanks to the cloud and companies such as Snowflake and Databricks, paying for only the computational resources you use (e.g., AWS) is becoming a reality. This is a great fit if you only run a few inquiries a week, but if your organization needs to run hundreds of data tasks a day, the economic shift (and the business itself) is left to offset the cost of the pay-as-you-go model. Bottom line: some normalization is coming, and the future is moving to one that can provide the right balance between fixed and variable pricing.

Containerization and adoption of Kubernetes to deliver massively scale-out, elasticity, manageability. Today's cloud systems come in a variety of distribution types, including infrastructure-as-a-service (IaaS), platform-as-a-service (PaaS), and software-as-a-service (SaaS). However, each approach requires complex cluster management, and 2019 will usher in the age of serverless and low-touch administration and management. In fact, cloud vendors such as Google and AWS are already setting the example for how users can consume technology without having to own the complex cluster management previously required. Container technology such as Kubernetes enable vendors to provide managed services to the end user where all of the configuration is packaged and managed in a simple way. Bottom line: all parties win because vendors no longer need to continuously support the customer, and the customer can use the value of the service without having to keep it up and running.

Self-Service 2.0 -- The shift to self-driving technology. Just as self-service applications are all the rage, the industry is preparing for its next incarnation: self-driving technology where machines take the lead and humans validate and teach the machines the exceptions. Today, users benefit from machine learning but still do most of the heavy lifting when it comes to manipulating the data. However, now that organizations are deriving value from ML, it's time for a role reversal. Instead of humans doing 80 percent of the work and machines handling 20 percent, companies such as DataRobot and other self-driving paradigms are setting a new standard where machines assume 80 percent of the heavy lifting while humans have more time for thoughtful analysis. Bottom line: self-driving technology stands to improve productivity and expand and broaden the types of users who can interact with data by themselves.

Customer privacy in the new age of data democracy. Just as in politics, democracy does not exist without rules and constitutions. Similarly, data democracy does not exist without guidelines such as governance, security, data lineage, and collaboration. Data is at the center of almost every strategy, including customer experiences, product development, and optimization of business processes. To support these initiatives, organizations are democratizing access to information, making it available to more business users. This can yield impressive operational results, but the risk is that more copies of data are being made, more people have access to information, and with the advent of GDPR and other regulations relating to consumer privacy, the risks for organizations are increasing.

As various users within the enterprise ask for data relating to customers, purchase patterns, or renewal patterns, organizations must establish standards that match their world of self-service and democratization with governance, security, and the enterprisewide ability to track data lineage and usage. Bottom line: chief data officers, chief analytics officers, and IT leaders must bring balance to the enterprise and drive the adoption of standardized data technologies and approaches.

About the Author

Piet Loubser is senior vice president, global head of marketing at Paxata, a pioneer and leader in enterprise-grade self-service data preparation for analytics.


TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.