(By You Yunting) Recently, a lawsuit filed by U.S. copyright holders against NVIDIA for allegedly using pirated materials to train AI models has attracted significant public attention. According to the complaint, in order to quickly obtain more than 500 terabytes of data, NVIDIA proactively contacted the pirate website Anna’s Archive and paid hundreds of thousands of U.S. dollars to download a large volume of pirated content, including copyrighted books and articles.
Anna’s Archive is one of “shadow libraries” known for their decentralized and anonymous nature, most of which typically provide access to literature in a way that infringes upon its copyright. If the plaintiffs’ allegations are true, it will be a serious blemish on the reputation of NVIDIA, the world’s most valuable company, to have paid a pirate website for content and then been sued by copyright holders. However, the unauthorized use of training data can be considered the “original sin” of nearly all general AI companies. In both China and the United States—the two global leaders in AI technology, numerous lawsuits concerning AI training data have already emerged. We will discuss whether, under Chinese law, NVIDIA’s alleged conduct will be considered breach of law.
I. Both P2P Downloading and Direct Downloading Carry High Legal Risks
1. Technical features of Anna’s Archive
To determine whether NVIDIA’s downloading behavior constitutes infringement, it is necessary to first clarify how the pirated contents were obtained. Based on available information, Anna’s Archive is not a traditional pirate website that directly hosts content, but indexes resources from multiple pirate sources. By use of decentralized P2P (Peer-to-Peer) technology, such as BitTorrent (BT), many sources’ data is distributed and stored across multiple nodes worldwide. Users may simultaneously act as an upload node to other users when downloading the data.
2. Legal risks of P2P Downloading
If NVIDIA has used P2P downloading methods such as BT torrents or magnet links, this step alone can already give rise to infringement risks because in a P2P protocol, a downloader typically becomes an uploader at the same time by transmitting downloaded data fragments to other nodes. Under China’s Copyright Law, uploading constitutes distribution or communication of works to the public through information networks, potentially infringing the copyright holder’s right of communication through information networks. To the author’s knowledge, as to whether the uploading behavior in P2P downloading constitutes infringement, there are currently no clear judicial precedents in China. This may be attributable to the fact that copyright holders have primarily and actively initiated lawsuits against commercial entities rather than individual downloaders.
3. Infringement Determination for Direct Downloading
Even if NVIDIA directly downloaded data from Anna’s Archive’s servers, the legal assessment would remain unfavorable because the materials available on Anna’s Archive largely consist of pirated books and literature. If NVIDIA knowingly acquired and used such works for commercial purposes, it would face substantial legal risks. Paying Anna’s Archive for high-speed access is equivalent to purchasing pirated copies. In Chinese judicial practice, the commercial use of pirated materials—whether pirated Windows or Office software, or pirated books from Anna’s Archive—is generally deemed copyright infringement.
The key factor is that NVIDIA paid the pirate provider, not the copyright holder and that the payment was made not to obtain a legal license but to facilitate access to pirated content for the commercial purpose of training AI models. Under China’s Copyright Law, whoever reproduces, issues or disseminates works to the public without permission from the copyright holder shall, as the case may be, bear civil liabilities, including ceasing infringement, eliminating adverse effects, making apologies, and paying compensation for damages.
II. The Legal Characterization of Reproducing and Training Remains Undefined in Law
After downloading more than 500 terabytes of training materials, NVIDIA would need to import the materials into its training data storage systems, create backups, and conduct preprocessing such as data cleaning and format conversion before training begins. Downloading and storing such materials on training servers inevitably results in the creation of digital copies. Although such copying falls within the scope of “reproduction” under the copyright law, since its purpose is tied to AI training, whether the copying itself constitutes infringement depends on whether the training activity constitutes infringement.
However, the legal status of AI training under copyright law remains unclear. AI training differs not only from direct dissemination of works (such as distribution or dissemination through information networks) but also from derivative use of works (such as adaptation or translation). Rather, the way AI models use works is another form of analysis and utilization through technical means to acquire new functionalities (such as intelligent decision-making or content generation). Article 10 of China’s Copyright Law also includes a catch-all clause for “other rights which should be enjoyed by the copyright owners” following a list of exclusive rights. Whether NVIDIA infringes these “other rights” of copyright holders will depend on how courts interpret and apply the law in conjunction with the characteristics of the new technology.
III. Fair Use Defense: Challenges and Prospects
1. Incompatibility of Existing Provisions for Statutory Defense
Article 24 of China’s Copyright Law lists 13 cases of fair use, none of which fully align with AI training. For example, using millions of books to train an AI model clearly exceeds the scope of personal study, research, or appreciation. Nor does it constitute appropriate quotation for the purpose of introducing or commenting on a specific work or demonstrating a certain issue, since typically excerpts of the original work are preserved for readers in the case of appropriate quotation, whereas AI training involves large-scale assimilation of content to extract general knowledge. Similarly, AI training does not fit the case of translation or limited reproduction for teaching or scientific research purposes because NVIDIA is not an educational or scientific research institution, and acquiring more than 500 terabytes of materials cannot be regarded as “limited reproduction”.
The Copyright Law also contains a catch-all clause for fair use stipulating “other circumstances provided by laws or administrative regulations”. However, at present, no law or regulation classifies AI training as an exception of fair use. On the contrary, departmental rules such as the Interim Measures for the Administration of Generative Artificial Intelligence Services require that data with legitimate sources shall be used and no intellectual property rights shall be infringed upon in training data processing activities for generative AI services. This requirement actually excludes the legality of training AI with unlawfully obtained data, directly conflicting with NVIDIA’s alleged conduct.
2. Exploration of Transformative Usein Domestic Judicature
NVIDIA may also attempt to invoke the concept of “transformative use” originally developed in U.S. jurisprudence and accepted by some Chinese courts and apply the catch-all clause for fair use by use of the three-step test under the Berne Convention. Transformative use refers to using a work in a manner that adds new meaning or function to it and makes it distinct from the original purpose. AI training converts the content of a work into model parameters used to generate entirely new outputs, which may be viewed as transformative.
The Shanghai Intellectual Property Court has previously applied the three-step test in the Black Cat Detective case, holding that any transformative use which does not conflict with the normal exploitation of the work or unreasonably prejudice the legitimate interests of the copyright holder, may constitute fair use. In the Ultraman case in 2025, the Hangzhou Internet Court also suggested that where AI training data has lawful sources and the training content has not been disseminated externally, fair use may be applicable to a limited extent.
3. Market Substitution Risks and U.S. Judicial Trends
Nevertheless, AI training remains controversial with respect to whether it conflicts with the normal exploitation of works or unreasonably prejudices copyright holders’ legitimate interests. If training data includes a large volume of specialized works from specific fields, the resulting model may generate content in similar styles, constituting a potential substitute for the original works in the market, which may be deemed to conflict with normal use and unreasonably prejudice copyright holders’ legitimate interests. By contrast, if the AI model merely learns abstract features with weak substitution effects, a fair use defense of transformative use may have a higher likelihood of success.
In a U.S. case last year initiated by copyright holders against AI company Anthropic, the court held that AI training was highly transformative, akin to human learning and creation, and thus leaned toward identifying it as fair use. In a separate case brought against Meta for alleged unlawful AI training, the court similarly found Meta’s use to be highly transformative, aimed at developing tools capable of generating diverse text rather than simply copying or substituting the plaintiffs’ books. While the court acknowledged potential indirect market competition, it held that the plaintiffs failed to provide sufficient evidence. Both cases, however, have not yet reached final judgment.
IV. Outlook for Legislation and Judicature
The foregoing analysis shows that artificial intelligence, as an emerging technology, currently lacks clear legislative guidance, resulting in insufficient legal basis for adjudication. If future amendments to China’s Copyright Law or judicial interpretations cover AI training, the following directions can be taken into consideration.
1. Legislative Approaches
Fair use provisions for AI training, clarifying that AI training may qualify as fair use under certain conditions, can be introduced, drawing on Japan’s Copyright Law prescribing that whoever does not aim at enjoying the thoughts or emotions expressed in the works and does not unreasonably prejudice the copyright holder’s interests is allowed to use the copyrighted works of others in a reasonable manner. At the same time, an authorization and licensing mechanism can be established to facilitate collective management organizations or new licensing platforms in providing efficient bulk licenses for AI training, striking a balance between protecting copyright holders and reducing compliance costs. Standards for transformative use can also be clarified by incorporating factors such as purpose and nature of use, and degree of transformation into fair use analysis.
2. Judicial Approaches
Absent legislative breakthroughs, considering the current laws and judicial attitudes, Chinese courts may adopt several possible paths when addressing cases like NVIDIA’s alleged purchase of pirated materials for AI training:
To preserve space for AI industry development, until legislation is clarified, courts may refrain from directly ruling on whether AI training itself constitutes infringement, but instead focus on acquisition of pirated content and unauthorized reproduction to establish infringement. Courts may distinguish between data with and without lawful sources, remaining open to AI training using authorized data. As in the aforementioned Anthropic case, the U.S. court has identified the digitization of legal copies as fair use while rejecting utilization of pirated materials. Courts may explore the application of fair use to a limited extent by allowing AI training under specific conditions with no output of original text, no impact on the market, and no substitution for the original works, while still requiring training data from lawful sources.
Ultimately, the NVIDIA case itself is not an isolated case, but a microcosm of the challenges faced by copyright law in the AI era. Under China’s current copyright law framework, purchasing pirated materials for AI training poses significant infringement risks at the stages of data acquisition and reproduction. However, the core issue for the AI industry—whether AI training itself constitutes infringement or fair use—remains unsettled for lack of clear legal basis. The final answer will not be written by a single court decision, but by how the law ultimately redraws the boundaries between protecting creators and encouraging technological innovation.
Lawyer Contacts
86-21-52134918
youyunting@debund.com/yytbest@gmail.com
Disclaimer of Bridge IP Law Commentary
Short Link: