TDWI Articles

New York Times Versus Microsoft: The Legal Status of Your AI Training Set

Watch out! Your AI solution should guarantee that you won’t be responsible if any legal claims are made about its training data.

As we ended 2023, The New York Times entered litigation against Microsoft and OpenAI for the alleged violation of The New York Times’ intellectual property, which was used to train ChatGPT. Although I have doubts whether this litigation will be successful because it seeks to eliminate the related training sets, it will undoubtedly clarify the rights associated with using internet-sourced information to train AI.

For Further Reading:

Executive Q&A: Generative AI Poses Security Risks

Picking an AI Vendor: Look for Indemnification and Experience 

The Three Most Important Emerging AI Trends in Data Analytics

Given that The New York Times represents that its data is one of the largest repositories of training information and the amount of money focused on developing AI, I believe it is at least likely that not only will rules surrounding that information become more tightly defined, but that The Times might be sold to a company that values this information but not the Times as a news service. With a valuation of around $8B but potentially having a significant adverse impact on OpenAI’s current $86B valuation, it would seem to be a no-brainer, should the New York Times have a valid complaint, to just buy it and shut down the related litigation.

Let’s talk about what this means for generative AI in 2024 and why you will need to ensure the legal status of your AI training set and put stronger guardrails on how the technology is used.

Indemnifying Generative AI as Protection Against the Litigation Magnet

Microsoft can easily afford this litigation. It is ranked as one of the most valuable companies in the world and has one of the strongest legal teams in tech focused on protecting its intellectual property, including defending against claims that it has illicitly used someone else’s technology. This means that both companies are set up to fund, staff, and execute litigation at this level.

This litigation is likely to be extremely expensive. If the Times wins, it will have a large adverse material impact on Microsoft and OpenAI. Should Microsoft and OpenAI prevail, it will likely crater The New York Times’ valuation, though there is a decent chance that the Times may also see this as an opportunity to sell the business at a premium to Microsoft as one of the less costly paths (depending on how the litigation goes) to settlement.

In addition, should The New York Times win or settle favorably, it will turn these training sets into litigation magnets. Not only is the Times likely to go after other tech companies that are using its data without permission, it is likely to go after the customers of these tech companies using the decisions that founded their win and settled law to go after users of the resulting AIs, as well. Furthermore, the Times will not be alone. Any entity that feels its intellectual property was used without permission for training and can provide basic proof of that could do the same, opening a Pandora’s box of litigation.

So, the indemnification of the training sets that some, but not all, AI vendors provide just became a critical aspect of any AI solution based on externally sourced data you are likely to want to deploy. Most companies are not set up for this litigation, so partnering with a company that can indemnify you just became a major part of your critical path to AI deployment success.

IP Contamination

One of the problems the tech industry has had from the start is product contamination using intellectual property from a competitor. The tech industry is not alone, and the problem of one company illicitly acquiring the intellectual property of another and then getting caught goes back decades.

If an engineer uses generative AI that has a training set contaminated by a competitor’s intellectual property, there is a decent chance, should that competitor find out, that the resulting product will be found as infringing and be blocked from sale -- with the company that had made use of that AI potentially facing severe fines and sanctions, depending on the court’s ruling.

Wrapping Up

Ensuring any AI solution from any vendor contains indemnification for the use of their training set or is constrained to only use data sets that have been vetted as fully under your or your vendor’s legal control should be a primary requirement for use. (Be aware that if you provide AI capabilities to others, you will find an increasing number of customers will demand indemnification.) You’ll need to ensure that the indemnification is adequate to your needs and that the data sets won’t compromise your products or services under development or in market so your revenue stream isn’t put at risk.

This year is going to be an exciting time for AI. Just make sure you are doing what you can to ensure that excitement doesn’t put your company at risk. When AI is used for product development, ensure that the training set has been fully vetted so it doesn’t compromise the resulting products, and avoid solutions that don’t come with indemnification (especially any that require you to indemnify the provider).

About the Author

Rob Enderle is the president and principal analyst at the Enderle Group, where he provides regional and global companies with guidance on how to create a credible dialogue with the market, target customer needs, create new business opportunities, anticipate technology changes, select vendors and products, and practice zero-dollar marketing. You can reach the author via email.


TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.