Next Year in Data Analytics: Data Quality, AI Advances, Improved Self-Service
Three interesting analytics trends to watch that could improve the use of data throughout your enterprise.
- By Mike Loukides
- December 9, 2022
Back in 2005, Tim O’Reilly said, “Data is the next Intel Inside.” He was right. We no longer talk about “Intel Inside,” and the Intel Inside stickers have all disappeared. However, we do talk about data -- a lot. Here are three trends in data and data analysis that will be worth watching in the coming year.
Trend #1: Being data-centric
A few years ago, we said “more data trumps better algorithms.” Now, we’re finding out that better data trumps more data. Data quality is an important component of technical debt. More data can introduce problems rather than solve them. Simply increasing the size of a data set can lead to unexpected behaviors, some of which may be correct, many of which are wrong, and none of which are predictable beforehand.
Getting better data for training AI applications means paying more attention to tagging training data. That can mean many things, including tagging the first thousand or so items yourself so you understand the problems in the data set and can do a better job of instructing others how to tag the data. It might mean hiring a small number of people to do tagging and paying them a living wage. Many AI developers use crowdsourcing services to tag data, but they incentivize workers to move fast and ignore quality. Becoming data-centric might even mean using AI-based tagging solutions, which when done correctly can be less error-prone than human tagging.
Being data-centric also means ensuring that developers know where their data comes from, know that it was collected ethically, and understand possible sources of error. Finally, being data-centric means charging developers with creating documentation for data sets, along the lines of Timnit Gebru’s groundbreaking paper, Datasheets for Datasets.
Whether you’re performing business analytics, making customer recommendations, or managing supply chains, becoming data-centric will make your results more accurate. That’s true whether you’re using machine learning or more traditional statistical applications. We’ve said “garbage in, garbage out” for years. It’s time we took that seriously.
Trend #2: Paying attention
Everyone in technology must be aware of the tremendous advances made in natural language processing over the last two years. NLP isn’t directly related to data analytics, but I want to go out on a limb and suggest an important direction for future work. Models such as GPT-3 deliver great results because of the use of transformers, a new kind of AI algorithm. Without going into technical detail, transformers implement a kind of “attention” called self-attention: they are able to determine what parts of a text are important based on context, not just word frequency.
Although transformers aren’t widely used outside of NLP (they are starting to appear in computer vision, with encouraging results), I wonder what implications attention might have for business modeling and forecasting. We are increasingly relying on AI to build financial models. How powerful would those models be if they had a concept of “attention” -- if they could decide, based on context, what data was worth paying attention to? What if the input to the model consisted of all the company’s financial data, local and worldwide economic data, historical data, and current news, along with possible scenarios for going forward? Such a system, equipped with the ability to determine what is important and what is noise, would be able to out-perform our current financial models.
Will self-attention be applied to financial analysis, supply chain predictions, inventory management, and other business problems? This is definitely a risky prediction. Academic research tends to focus on problems such as natural language and computer vision, not business problems, but it is hard to believe that businesses wanting a competitive advantage will ignore the success of transformers in the academic world. It is equally hard to believe that existing SaaS platforms won’t see this as an opportunity to extend their product offerings.
Trend #3: Self-service data
Despite a lot of talk, self-service data is still in the early days. A few things are holding it back. First, data is still often held in silos: different repositories owned by different constituencies within a company, designed without any thought for compatibility or even use by other parts of the organization. At worst, silos are tied up with an organization’s political infighting. Nothing about data silos is conducive to self-service data. At the other extreme, some organizations have broken down their silos, replacing them with a data lake (or lakehouse, or warehouse, or some other house-y or watery metaphor). No more silos, but you still don’t have self-service data -- you have a lot of undigested data that is often unstructured, dumped into a mass storage system with minimal thought about how people are going to use it.
Data meshes are part of the solution to this problem. Data meshes allow groups within an organization to be responsible for their own data -- after all, they understand the data -- while making it available to the rest of the organization. Another key part of self-service data is a data catalog: a company-wide directory that lets users discover what data exists and shows what it has been used for, who is responsible for it, and other metadata. Good data governance also makes self-service easier because it forces you to document the data you have: where it came from (provenance), how it was collected, restrictions on its use, and other metadata. Good governance also entails taking responsibility for what users do with data. Self-service users can’t take a Wild West approach where anything goes.
One element is still missing: self-service data requires widespread data literacy. The shortages of data scientists, data engineers, and AI experts are real issues. Democratizing data also means that the people using data must understand how to use data properly. We have low-code and no-code tools that do a good job of building simple applications and doing basic analytics. However, the people who use these tools must be able to answer basic questions about when data is meaningful, how confident they are in the results, and whether the original data was gathered in a way that didn’t introduce errors and biases. There’s no shortcut around data literacy.
Those are O’Reilly Media’s three trends to watch. One may be a long shot -- but an important long shot. If you’re right about all your predictions, you’re not predicting. The longshots are the most interesting -- and if they win, the most important.
Mike Loukides is vice president of content strategy for O'Reilly Media, Inc. He's edited many highly regarded books on technical subjects that don't involve Windows programming. He's particularly interested in programming languages, Unix, and what passes for Unix these days, AI, and system and network administration. Mike is the author of System Performance Tuning and a coauthor of Unix Power Tools and Ethics and Data Science. Most recently he's been writing about data and artificial intelligence, ethics, and the future of programming. Mike can be reached on Twitter and LinkedIn.