TDWI Articles

Picking an AI Vendor: Look for Indemnification and Experience

AI is full of promise, but also pitfalls. Heed this advice about AI indemnification and training data set best practices to help keep your enterprise and your career safe.

When looking at an AI solution, the most critical part is the data set used for training because creating your own can cost more than the AI hardware you purchased. However, the methods for training new large language models (LLMs) can create some unusual legal challenges. If the training data set used includes “The Pile” -- a massive 800 GBs of diverse test data for LLMs -- you may be exposed. Furthermore, if you aren’t careful, using unauthorized data in the data set could result in litigation about copyright infringement or data theft.

Indemnifications

There are typically two kinds of indemnifications when it comes to AI. With the first, the vendor indemnifies you. In the second, the vendor demands you indemnify them -- you want to avoid these vendors unless you are certain their AI training data set is clean and won’t result in litigation. When a vendor indemnifies its training set for your use, it means they have vetted it and are confident no significant liability exists.

For Further Reading:

How Generative AI Is Changing How We Think About Analytics

The Five Ds for AI Project Deployment Success

Using AI to Advance Analytics

Because these data sets are very large, your ability to assess them is likely inadequate. That is why you should avoid having to indemnify others, and you should require your chosen vendor to indemnify you. I’m not an attorney, but I recommend that when you see an indemnification clause, make sure it protects you and doesn’t expose your firm to avoidable litigation. If you sign an indemnification agreement without legal consultation and review, you are putting your career at risk.

Training Set Approach

Companies have run trials of these new AI solutions for several months now, and some have begun putting them into production. Vendors are reporting that the current recommended enterprise practice is to first create a proof of concept that you don’t deploy. Use this with a large language model that can perform a wide variety of tasks to validate that your concept is feasible. Once you are comfortable with the result, create a far smaller model focused on the specific task you need to complete.

Large language models require a massive amount of processing power for a typical task, and this makes the cost of each transaction or interaction unsustainable at scale. Creating a smaller, more efficient, focused model reduces the cost overhead, should increase overall performance, and is more sustainable because it reduces wasted energy. These smaller models are created using your firm’s own data or data from a source that you vetted to ensure that using the LLM in production doesn’t create intellectual property litigation exposure.

The Litigation Risk

The cases challenging the training of large language models were mostly filed recently, so there is little case law to consult. I doubt many of those cases will be successful. After all, people also learn from things they observe, and making a broad claim of liability for any proprietary data that is used to train AIs could result in opening challenges to humans who generally also learn by observation.

I expect it will be years yet before this is all worked out and the extent of the exposures and liabilities are known. Meanwhile, it would be wise to be conservative and prudent about using third-party data for training or using any LLM trained with third-party data until we have adequate case law to better understand the related exposure.

A Final Word

Choosing a vendor has less to do with the hardware and software than with the data sets the vendor uses for training. Favor those with strong indemnification clauses to protect you, don’t require you indemnify them, and have the financial resources to provide the legal protection you’d expect if your firm is sued.

Don’t skip the legal review and approval step, either. Not only will related litigation be problematic given we are in the early stages of this technology, it will also likely be public, which could significantly damage your and your firm’s reputation.

About the Author

Rob Enderle is the president and principal analyst at the Enderle Group, where he provides regional and global companies with guidance on how to create a credible dialogue with the market, target customer needs, create new business opportunities, anticipate technology changes, select vendors and products, and practice zero-dollar marketing. You can reach the author via email.


TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.