Best Practices to Ensure Data Transformation Success
Prophecy’s founder and CEO Raj Bains explains how organizations can overcome the challenges to achieving enterprise-wide data transformation.
- By Upside Staff
- June 24, 2024
Upside: How does generative AI change how people can work with data? How does it change how they should work with data?
Raj Bains: Generative AI is fundamentally transforming how people work with data, especially in business intelligence and data transformation. Data transformation, traditionally time-consuming and resource-intensive, is set for a significant change as technologies that leverage intelligence about data and language models are providing unique approaches to transforming, organizing, and getting insights from data. The industry's current approach to data transformation is not measuring up. Simplistic tools that cater to many users lack power. Cloud data platforms are powerful but remain accessible only to expert data engineers. Both of these approaches fail to adequately address the problem.
The data transformation process, essentially a narrow form of programming that is rich in context, is well-suited for AI-powered copilots that cater to users at various skill levels and integrate with visual data pipelines. Without some guidance from AI-powered visual copilots, it’s a challenge for some organizations to know what to do next, such as what tables to join or how target columns should be computed from source columns.
It can also be daunting to develop the transformations themselves because of the necessary coding that often requires expertise in PySpark or Scala. It’s also difficult to make changes to the code because it requires an understanding of the code itself. (This could be alleviated if users could make changes within the visual interface.) Alarmingly, documentation, explanations, tests, and commit messages are often completed as an afterthought, if at all.
There are several concerns, such as hallucinations and bias, about using generative AI for production work. When it comes to data transformation, how do you know generative AI is doing an accurate job? Is the technology truly ready for prime time?
Generative AI is ready for production work, not as the primary developer, but as a copilot assisting the primary data user. When developing a visual pipeline for data transformation, the copilot suggests transformations that the user can inspect, showing the resulting data after each step.
The visual interface allows easy final edits to ensure accuracy, which enables the copilot to identify similar values computed elsewhere with subtle differences to prevent errors. It also generates documentation, explanations, and tests to verify that the output matches user intent and maintains data quality.
By combining a visual interface with generative AI, copilots will significantly boost productivity without sacrificing quality.
Data platforms use generative AI to simplify a number of processes, but organizations are still struggling to get data into the hands of those that need it most. Why?
Generative AI and large language models (LLMs) have initially appeared as tools that generate text or code from prompts based on publicly available data. This presents two challenges: products such as ChatGPT lack specific organizational context, and tools such as GitHub Copilot, which generate code from prompts, are only usable by expert coders.
We'll soon see products that are more intuitive, require fewer tedious prompts, and are better integrated into user-friendly interfaces. These tools will be deployed within organizations, learning their context and becoming more useful. Early technologies such as retrieval-augmented generation (RAG) are steps in this direction. Given a few months to mature, these products will start delivering real value.
What should organizations do to address the unique data transformation needs of different types of users?
Organizations need to enable different data users to be productive in their roles. At one end are data engineers who set up data platforms, processes, standards, and frameworks, and build central pipelines for large data sets, ensuring performance and cost efficiency. At the other end are business data users, including data analysts and data scientists, who focus on business problems. They need an easy-to-use visual interface that enables them to build daily pipelines independently of the data platform teams.
Where data and analytics are centrally important to all types of users, the key question is how will generative AI change how we all work with data and what is needed to ensure success? For decades, we've known that obtaining clean, high-quality, and timely data poses one of the greatest challenges for enterprises. This challenge is especially critical as enterprises seek to capitalize on the promise of AI.
Most business teams feel starved for data, while central data platform teams are overwhelmed and can only deliver a fraction of what is needed, consuming excessive resources in the process.
Copilots improve the accessibility and availability of data for technical and non-technical users throughout the enterprise, democratizing data and analytics while ensuring the delivery of clean, trusted, and timely data needed for analytics. They also help data users increase their productivity. The key is to meet the needs of all users on the same platform by allowing data platform teams to assist business users and ensure everyone follows best practices and frameworks. Future copilots will be ubiquitous, accommodating all skill levels without compromising the platform's power.
Well-known copilots are producing impressive results, improving productivity. How should copilots be used for data transformation?
Copilots are showing impressive results, especially in programming, as seen with GitHub Copilot's rapid adoption and productivity improvements. These benefits will extend to data transformation with specialized copilots as well.
Data transformation copilots must be:
- Integrated and comprehensive. Copilots must work with existing data platforms and support the entire data transformation life cycle.
- Intuitive and intelligent. They must provide a visual and integrated interface for data analysts and a code interface for data platform users. By appealing to both groups, generative AI should handle half the work of developing, deploying, and observing pipelines to boost productivity.
- Open and extensible. These visual interfaces should produce Spark or SQL code and enable data engineers to create standards and frameworks for all users.
With these capabilities, copilots can help data analysts and visual ETL developers create data pipelines, with AI doing half the work, drastically improving productivity. Data platform teams can code standards and frameworks and make them available as visual components. Most important, a single platform for all users reduces costs, increases productivity, and ensures higher data quality.
[Editor’s note: Raj Bains is the founder and CEO of Prophecy, a data copilot company that enables businesses to accelerate AI and analytics by delivering data that is clean, trusted, and timely. Previously, Raj led project management of Apache Hive at Hortonworks through their IPO and headed product management and marketing for a NewSQL database startup. His engineering roles include developing a NewSQL database, building a CUDA compiler at NVIDIA as a founding engineer, and as a compiler engineer working on Microsoft Visual Studio.]