Vector Databases and What They Mean to Generative AI
James Kobielus, TDWI’s senior director of research for data management, discusses the importance of vector databases to recent innovations in generative AI.
- By Upside Staff
- September 29, 2023
In this recent “Speaking of Data” podcast, TDWI’s James Kobielus discusses the importance of vector databases to recent innovations in generative AI. Kobielus is senior research director for data management at TDWI. [Editor’s note: Speaker quotations have been edited for length and clarity.]
Kobielus began by explaining that although vector databases are not new technology, their particular strengths in dealing with vectorized data has made them especially useful in generative AI.
“In mathematics, a vector incorporates a position in coordinate space and a direction,” he said. “In the context of data, a vector is a way of representing multidimensional data.” Vectorized data or vectorized embeddings, Kobielus explained, is an ordered array of numbers that record measures of proximity and similarity, as well as the data itself. “It is this similarity that really makes generative AI work.”
“You can vectorize any type of data -- text, images, media, sensor data, and so on,” he said. “Basically, data -- usually unstructured data -- is run through a neural network that creates vectorized representations of each element, looking for patterns, sematic connections, and more.” This data is then fed into the algorithms that power today’s generative AI applications.
“Vectorized data doesn’t necessarily need a vector database to handle it. Most of the primary use cases for vector databases -- natural language processing, recommendation engines, and so on -- have previously been performed by graph databases, document databases, or even key-value stores. However, vector databases are optimized to handle large amounts of vectorized data, providing much quicker query and processing response times.”
Kobielus also stressed that, with the current rate of adoption of generative AI, professionals in the field will have to quickly become familiar with the concepts and methods of vector databases.
“Vector databases won’t replace relational or other types of databases,” he said. “After all, these are most often the sources of the data to be vectorized.” However, over time, as more enterprises deploy their own generative AI programs on premises, they will need people who know how to set up and manage vector databases alongside their other data platforms.
“Vector databases don’t generally use SQL commands but rather operate with API methods or custom functions, so people will have to learn new query and programming languages.” Vectorizing data also relies on neural networks, so organizations will have to rely heavily on their data science team members who have experience in that area, Kobielus added. “It may be that your current staff will need training -- perhaps even extensive training -- to get their heads around this new technology.”
Kobielus offered some tips for evaluating vector databases.
“First, make sure it’s scalable, because not all of them are.
“Next, evaluate them to see how well they leverage the skills and tools your organization has already adopted for things such as graph and document databases. You need to also see if they are deployable in a distributed environment for parallel processing. You’ll also want to see if it offers single and multinode performance.”
However, he warned, the most important thing to look at is how well it supports similarity analysis of vectorized data, given that this is the core use case of vector databases in the first place.