(By You Yunting) It was reported that Authors Guild and 17 writers, including George Martin, author of A Song of Ice and Fire: A Game of Thrones, brought a collective action against OpenAI, an intellectual intelligence company to the United States District Court for the Southern District of New York, claiming that OpenAI used the copyrighted work to train AI models without authorization[1]. With great interest in how the plaintiff proved OpenAI misused A Song of Ice and Fire: A Game of Thrones to train ChatGPT, I read the Complaint on the website of Authors Guild. In this article, I would like to analyze this case based on copyright laws.
It is well known that high-quality training data is essential for AI services to be able to answer questions, but the copyright laws of most countries in the world require AI developers to obtain permission from copyright holders before using others’ copyrighted content for training. AI developers, however, for various reasons (e.g., to avoid tedious and lengthy licensing negotiations, to meet development deadlines, to reduce copyright costs, etc., and fail to get license from some of the copyright holders even if they will be paid), directly in the AI development used pirated content as training data.
But for this practice, because the training data of the large model is not publicly available and the training process is also a black box operation, it is very difficult for the copyright owners to provide evidence even if they know that there is infringement. Since there is a lawsuit in the United States now, let’s look at how the US attorneys prove it.
I. Self-recognition by the accused (actually self-recognition by ChatGPT)
The Complaint alleges that OpenAI has copied copyrighted books authored by the plaintiffs without the authors’ consent in order to train its large language models, and that OpenAI has publicly admitted it because the plaintiffs’ attorneys asked ChatGPT and it responded:
It is possible that some of the books used to train me are under copyright. However, my training data comes from a variety of sources publicly available on the internet, and it is likely that some books included in my training dataset are not authorized to be used …. If any copyrighted material is included in my training data, it is used without the knowledge or consent of the copyright holder.
At the same time, the plaintiffs’ counsel discovered that, until recently, ChatGPT had been able to accurately output the original text from the copyrighted books, suggesting that these books must have been entered into the underlying large language model in their entireties during the training process, although ChatGPT has recently been modified to respond to such prompts with the statement “I cannot provide verbatim excerpts from copyrighted texts”. The timing of this apparent change to ChatGPT’s output rules is probably due to the open letter sent by the plaintiff Authors Guild to OpenAI and other companies.
In my opinion, although ChatGPT self-admitted that it may not have been authorized to use the materials for training, given that AI often spout unsupported nonsense in their replies (ChatGPT had said that I, an IP lawyer, was a criminal lawyer who had recently handled a very famous rape case, clearly making things up without any basis), this reply would have to be corroborated by other evidence in order to be found by the court to be a fact of the case .
II. The training material package comes from well-known pirate websites
The Complaint alleges that instead of verbatim excerpts, ChatGPT now provides summaries of copyrighted books, which often contain details not available in reviews and other publicly available materials – again, suggesting that the entire books must have been entered into the underlying large language model during the training process. OpenAI, however, remains opaque about where and how it obtained the plaintiffs’ copyrighted works, admitting that the training dataset it used to train the model consisted of “Common Crawl” and two high-quality internet-based book corpora which it calls “Booksl” and “Books2”.
Common Crawl is a large and growing corpus of “raw web page data, metadata extracts and text extracts” crawled from billions of web pages. It is widely used to train large language models, such as OpenAI’s GPT as well as Facebook’s and Google’s AI engines. It is known to contain text of books copied from pirate websites (linked to Z-Library, another large pirate book repository of over 11 million books, which appears in the Common Crawl corpus and is included in the training datasets of other large language models).
OpenAI declined to discuss the source of the Books2 dataset. But some independent Al researchers suspect that Books2 may contain or consist of e-book files downloaded from large pirate book repositories, such as Library Genesis or “LibGen” “which provides a large repository of pirated text”. LibGen is already known to the courts as a notorious copyright infringer. Other possible sources for Books2 include pirate seed trackers which allow users to download e-books in bulk such as Z-Library and Bibliotik.
Plaintiffs’ attorneys were unable to prove the source of Books2’s data, so they cited as analogous evidence the well-known training repository “Books3”, which contains a large amount of pirated content (as reported by Wired, Facebook’s and Bloomberg’s large language models use the repository “Books3″[2]): the disclosed size of the Books2 dataset suggests that it contains more than 100,000 books. The similarity in size between Books2 and Books3, and the fact that there are only a few pirate repositories on the internet that allow bulk downloads of e-books, strongly suggest that the books in Books2 are also taken from the notorious repositories discussed above.
If this lawsuit was filed in China, and the plaintiff initially proved that the defendant’s training material was pirated, at this point, the burden of proof was on the defendant, who had to prove that its training material was not pirated, or else the court would uphold the plaintiff.
III. How did George Martin prove that OpenAI trained ChatGPT with his works?
The Complaint alleges that George Martin is the copyright owner of fifteen works of fiction, including A Game of Thrones, all or many of which has been ingested and copied by OpenAI without permission to train its large language model. The Complaint gives two examples of third-party journalism:
- In July 2023, a programmer named Liam Swayne[3] used ChatGPT to generate versions of The Winds of Winterand A Dream of Spring, the last two books in Martin’s ongoing series A Song of Ice and Fire.
- Researchers at the University of California, Berkeley, conducted an experiment on the degree of “memorization” of works by ChatGPT[4] and found that Martin’s novel A Game of Thrones ranked 12th with respect to the degree of “memorization”.
The plaintiffs’ attorneys then used ChatGPT for testing, and after entering the prompts, ChatGPT accurately generated summaries of several of Martin’s infringed works, including the first three books in the series A Song of Ice and Fire, A Game of Thrones, A Clash of Kings, and A Storm of Swords, as well as an accurate summary of the final chapter of The Armageddon Rag.
After the plaintiffs’ attorneys entered the prompts, ChatGPT also generated a detailed outline for another sequel to Martin’s works A Clash of Kings, and titled the derivative “A Dance With Shadows”, using the same characters from Martin’s existing books in the series A Song of Ice and Fire. ChatGPT also generated a detailed outline for a prequel to A Game of Thrones, and titled the derivative “A Dawn of Direwolves”, using the same characters from Martin’s existing books in the series A Song of Ice and Fire.
The Complaint summarizes that ChatGPT could not have generated these results if OpenAI’s large language model had not ingested and been trained on Martin’s infringed works. In my opinion, if this lawsuit was filed in a Chinese court, Martin’s attorneys had already proven that ChatGPT used Martin’s works for training, and that copies of Martin’s works remained on its servers.
IV. Does AI training require no authorization from the copyright holder?
In legal principle, the process of AI learning online content is a copying behavior or temporary copying behavior. AI companies have to crawl the content obtained online or offline first, and then input it into the AI program. No matter whether this content is text, pictures, audio, video or program, the copying behavior requires permission of the corresponding right holders, or else it will be suspected of infringement.
In fact, China’s laws and regulations also stipulate this way. According to China’s Copyright Law and the Interim Measures for the Administration of Generative Artificial Intelligence Services jointly issued by seven departments, China’s generative artificial intelligence service providers (hereinafter referred to as providers) shall not infringe on the intellectual property rights enjoyed by others in accordance with law when they carry out pre-training, optimized training, and other training data processing activities. In other words, the training material requires license from the copyright owner. In the U.S., as mentioned above, copyrighted material for training AI requires authorization from the copyright owner.
However, there are no same regulations on this issue in some countries. For example, according to Article 30, Paragraph 4 of the Copyright Law of Japan, AI training with copyrighted content is considered fair use: the works of which the copyright is enjoyed by others can be used reasonably if it is not for the purpose of appreciating the ideas or sentiments expressed in the works and there is no unreasonable prejudice to the rights and interests of the copyright owners.
I think, although the legislative purpose of Japan’s provisions may be to revitalize the country’s artificial intelligence industry, it is also a way of thinking because the content generated by artificial intelligence is not protected by the copyright law and its creation is common to all the people. It may be also quite good that all the copyrighted works are allowed to be used in AI training and its results require no authorization for use.
Footnote:
[1] https://finance.eastmoney.com/a/202309212853452095.html
[2] https://www.wired.com/story/battle-over-books3/
[3] https://game.sohu.com/a/704547146_114760
[4] https://hub.baai.ac.cn/view/26572
Lawyer Contacts
You Yunting
86-21-52134918
Short Link: