In a groundbreaking legal clash affecting future AI model training, The New York Times (NYT) has taken on OpenAI in a trendsetting copyright infringement lawsuit. Filed in the Federal District Court in Manhattan on December 27, 2023, the case portends to define how AI model developers can train LLMs and the new risks regarding copyright infringement.
The Allegations
The crux of the lawsuit centers around OpenAI’s unauthorized use of millions of NYT articles to train its generative AI models and tools. OpenAI trained ChatGPT with NYT content and presents query results as though ChatGPT alone generated the response, the NYT claims misrepresentation of authorship.The NYT contends that this constitutes copyright infringement, as OpenAI allegedly copied and utilized their content without proper authorization.
Reasons for the Lawsuit
- Training Data Ethics: The lawsuit underscores the ethical complexities of using copyrighted material as training data for AI models. While AI developers innovate, they must navigate the legal boundaries of fair use and intellectual property rights. The NYT’s legal action serves as a wake-up call for AI companies to tread carefully when sourcing training data.
- Attribution and Accuracy: OpenAI’s models, including ChatGPT, have gained widespread adoption. However, the lawsuit highlights the importance of accurate attribution. When AI-generated content references or displays journalistic work, it must do so faithfully. Misattribution can harm reputations and mislead audiences, emphasizing the need for robust verification mechanisms.
- Commercialization and Responsibility: As AI technologies become more commercially viable, developers must recognize their responsibility. The lawsuit prompts AI companies to consider the impact of their products on society. An AI system’s commercial success and reach has to be balanced against the rights of the original human content creators that provided the data to train the model.
Interpreting Fair Use For AI Model Development
Fair use of copyrighted material for AI developers hinges on four concerns:
- The purpose and character of the use: If the material remains largely unchanged, then the Ai developers could be stepping on copyright restrictions. If the AI developers transform the base material or data significantly, then the copyright laws are far more lenient.
- The nature of the original copyrighted material: Art work and creative content of all forms receives tighter copyright protection than more fact based or prosaic material.
- How much copyrighted material did the AI model use: Naturally, using all the copyrighted material implies direct infringement of the creators rights. Using only a small portion of the material is generally considered fair use.
- The effect the use has on the potential market value of the copyrighted material. This is perhaps the central issue for the NYT vs OpenAI. The NYT, justifiably, is concerned that OpenAI’s use of the decades of content created with nearly incalculable costs and human talent, will substantially diminish the value of the NYT business.
Legal Experts Focus On AI Business Impact and Transformative Use
According to law professor James Grimmelmann of the University of Chicagor: “OpenAI could argue that using the NYT article was transformative, creating something new and different from the original work. However, the NYT could argue that using such a large amount of material harms their market for licensing content.”
It appears that legal experts are leaning in the direction of OpenAI engaging in fair use. From Pamela Samuelson, law professor at the University of California, Berkeley: “The case will likely turn on the first factor, the purpose and character of the use. If OpenAI can demonstrate that GPT-3 has a transformative purpose, such as advancing scientific research, it could have a stronger fair use claim.”
Prior Cases For AI Model Training
Two similar cases to the NYT vs OpenAI case suggest that transformative arguments have a higher chance of avoiding infringement claims, while commercial impacts lead to strong copyright claims.
A 2005 case, Authors Guild vs Google, where the class action lawsuit claimed Google violated broad copyright claims by scanning millions of books for the library project, rested on the transformative nature of Google’s use. Since there appeared to be minimal impact on the market for the individual books, and Google transformed the material into a different format, the court did not hold Google liable to copyright claims.
The courts ruled differently in American Geophysical Union vs KNIP Energy LLC in 2020. By using scientific articles to train an oil and gas AI exploration model, the court ruled that the company impacted the commercial rights of the article authors and harmed the market for the primary articles.
A current case involving GitHub Copilot vs Developers highlights the risk of training AI with publicly available code though without attribution. The claim is that the AI developers violated software licensing terms.
General Lessons for AI Developers
- Legal Vigilance: AI developers must proactively address legal implications. Understand copyright laws, fair use, and licensing agreements. Seek legal counsel to ensure compliance when using external data sources.
- Transparency and Attribution: Transparency matters. Clearly disclose when AI-generated content is not human-authored. Properly attribute sources, especially when referencing copyrighted material. Accuracy and honesty build trust.
- Ethical Training Data: Choose training data wisely. Respect copyright and privacy rights. Develop AI models with a strong ethical foundation, considering societal impact.
- Employ Control Systems And Third Party Monitors: An unbiased and specialized third party AI monitoring tool, such as Data Science Group’s system, can help mitigate the legal and performance risk factors of AI before any issues impact business viability.
Concrete Steps For AI Developers To Avoid Copyright Issues
Unless you, as an AI developer, get explicit consent for using data from the verified owner of the data for your use, you potentially expose your company to future lawsuits and cease and desist orders. By applying the following concepts for your model development, you can limit your business risks:
- Use Synthetic Data For Model Training
You can apply data augmentation, Generative Adversarial Networks (GANs), and Variational Autoencoders (VAEs) to provide a facsimile of a real world data set.
Even though developing synthetic data creates a new project for AI developers, it has proven to be effective. Waymo self-driving taxis train with a hybrid of synthetic and real data with technically satisfactory results. - Apply Advanced Attribution Systems
This approach is becoming common. Enhance your AI model with detailed attributions that flow to the end user of the AI. Methods including Explainable AI (XAI), watermarking and using Web3 to federate the copyrighted data with origin details prevent lawsuits.
Meta is developing an integrated AI attribution model for assigning rights dynamically and end-to-end for AI applications. - Use Public Domain Data
Using public domain data would seem to be safest way to proceed, except for cases like GitHub. Even publicly available code came with licensing strings attached. The other concern involves the quality and reliability of the data. Public domain data is likely to require significant filtering and screening for quality prior to applying it for training your AI model.
A great example of real-world public domain data is the National Institutes of Health (NIH) providing free biomedical datasets for research purposes. Government agencies can be a reliable source of public domain data for everything from sociology to land use analysis.
Balancing AI Innovation With Ethical Development Concerns
The applications for AI often reach a broad audience, while datasets are typically held and controlled by one party. Your dilemma, as an AI developer, is how to balance the push for innovating and enhancing humanity with AI capabilities against the potential for violating the rights of the owner’s of the datasets you rely on.
Here are the principles you can apply for meeting both the innovation drive and copyright challenges the AI industry faces:
- Innovation is the number one consideration so long as it does not violate copyrights. AI attribution, transformative uses, and synthetic datasets can mitigate the challenges to innovation.
- Ethical concerns regarding bias, fairness, transparency, and accountability can be addressed by defining all AI development processes and documenting each step.
- Creating an ethical AI development framework will steer the training process clear of potential pitfalls by informing all relevant parties.
Risks for AI Developers
1. Legal Backlash: The NYT lawsuit demonstrates that copyright infringement claims are not theoretical. AI companies face real legal consequences. Ignoring copyright boundaries can lead to costly litigation.
2. Reputation Damage: Misattributed or inaccurate content harms a company’s reputation. AI developers risk losing public trust if their models disseminate false information or fail to credit original sources.
3. Regulatory Scrutiny: As AI technologies evolve, regulators may tighten copyright enforcement. Developers must stay informed and adapt to changing legal landscapes.
In the clash between The New York Times and OpenAI, the AI community witnesses a pivotal moment. As AI continues to shape our world, responsible development and legal awareness are paramount. The critical lesson for AI developers and AI companies, is to pre-emptively address the risks with a specialized third party monitoring system to maintain business integrity and avoid legal risks.
This article has explored the ways to avoid legal challenges, the limited case law defining fair use cases, developing an ethical methodology for creating AI models, and how to innovate without self-destructing via litigation. AI models are going to drive humanities development in ways we cannot fully imagine, so long as we, as AI developers, manage to not infringe on the natural rights of copyright holders.