Anthropic: uso de libros con copyright en entrenamiento de IA, fair use y certificación | Blog SAPIENSDATAAI

The legal fight over whether generative AI can be trained on copyrighted books without licenses has entered a new and potentially decisive phase. Three authors—Andrea Bartz, Charles Graeber and Kirk Wallace Johnson—filed a suit alleging that Anthropic copied and used millions of copyrighted books for training its Claude models. After a federal district court in Northern California allowed at least parts of the case to proceed and later issued a mixed summary-judgment ruling, Anthropic has appealed. Industry groups, libraries, author advocates and civil-rights organizations have mobilized around the procedural question of class certification and the substantive issue of fair use, arguing that the outcome could reshape how the entire AI industry accesses and uses copyrighted text.

What the plaintiffs allege

The complaint, filed as Bartz et al. v. Anthropic PBC, alleges that Anthropic built a large, unlicensed digital library by downloading and copying well over seven million e‑books from “pirate” sites, scanning purchased print editions, and making additional reproductions to train and test its language models. The plaintiffs seek class certification on behalf of all authors whose books are registered with the U.S. Copyright Office and allegedly copied without a license, and they request injunctive relief, damages, disgorgement, attorneys’ fees and interest. The case is pending in the U.S. District Court for the Northern District of California and may be joined by parallel claims (including music-publisher litigation) against Anthropic and other AI vendors.

District-court rulings and Anthropic’s partial victory

The district court has already produced a layered outcome: it allowed parts of the class-action theory to proceed, but later granted Anthropic partial summary judgment on the core question of whether the use of copied works for LLM training can qualify as fair use. The court divided the facts into separate categories (downloads from pirate sites; scanned purchases; internal reproductions for filters and development) and reached these principal conclusions:

The downloading and storage of ebooks from pirate sites is unlawful.
The use of the resulting copies to train the LLM may be permissible under the fair-use doctrine—because the use was judged highly transformative and did not, on the record, displace market demand for the original works.
Certain internal copying for auxiliary purposes (such as filter‑training) was treated differently in the court’s analysis and remains disputed.

In weighing the four statutory fair-use factors the court found: (1) purpose—strongly favors fair use due to the transformative aim of creating new AI-generated text; (2) nature—slightly disfavors fair use because creative works were involved; (3) amount—surprisingly the court judged the extensive copying as reasonably necessary and therefore not dispositive against fair use; and (4) market effect—the court concluded the training did not meaningfully displace the commercial market for the works at issue. Both sides retain the right to appeal these determinations.

The class-certification dispute and procedural stakes

Beyond fair use, the parties and many third parties are fiercely contesting whether this lawsuit should proceed as a class action. The district court in June allowed some class-certification arguments to advance, but Anthropic has appealed that decision to the federal Court of Appeals—the Ninth Circuit, which handles appeals from the Northern District of California. Anthropic and dozens of supporting organizations contend that certifying a class that could encompass up to seven million rights holders is procedurally unsound and would create an unprecedented exposure for the company and the industry.

Key procedural objections include:

Individuality of claims: Copyright claims typically require individualized proof of ownership and damages, making class treatment legally inappropriate.
Identification of rights holders: Identifying and notifying millions of authors is practically and legally complex—publishers may have dissolved, rights may have transferred, and many works could be “orphan works” without clear claimants.
Risk of massive liability: Anthropic warns that the certification could lead to claims aggregating into the hundreds of billions of dollars, pressuring settlement even where defenses (including fair use) exist.

Third‑party interventions: who’s supporting whom and why

The procedural appeal has drawn an unusually broad coalition. Major trade groups in the technology sector such as the Consumer Technology Association (CTA) and the Computer and Communications Industry Association (CCIA) filed statements backing Anthropic’s appeal, warning that a flawed class action would threaten the whole U.S. AI industry and U.S. technological competitiveness. Civil-liberties and library groups—the Electronic Frontier Foundation (EFF), Authors Alliance, the American Library Association, Association of Research Libraries and Public Knowledge—also urged the appeals court to block class certification, citing precedent like the Google Books litigation to show the difficulty of resolving ownership and rights at scale.

On the other side, advocates from the creative industries emphasize the real economic and moral harms of large-scale unlicensed copying, and some criticize the district court for underestimating the practical difficulties of identifying millions of rights holders and adjudicating individualized claims.

How this fits into the broader litigation landscape

Anthropic’s dispute is one among dozens of lawsuits over AI training data in the United States. Plaintiffs have sued or threatened suits against multiple AI operators—Alphabet, Google, OpenAI, Meta, Microsoft, Stability AI, Nvidia, Mosaic ML, Ross Intelligence and others—alleging unauthorized use of books, news articles, song lyrics and other copyrighted materials. Music publishers separately pursued litigation that led to a settlement with Anthropic and other record‑label agreements in which Anthropic agreed not to output copyrighted lyrics or use them as a basis for producing similar lyrics. These cases are concentrated primarily in the Northern District of California and the Southern District of New York, making the resolution in those venues particularly consequential for industry practices.

Practical and policy implications

The outcome of Anthropic’s appeal and related rulings could produce several concrete consequences:

Industry practices on data collection and curation: A ruling for plaintiffs could force AI companies to adopt more conservative ingestion practices, invest heavily in licensing, or rely on sanitized/curated corpora.
Litigation risk and insurance: Broad certification could create existential settlement pressure and reshape how companies budget for legal risk and insurance.
Regulatory and legislative responses: Congressional or administrative clarifications about training data, exceptions for machine learning, or new licensing frameworks could follow from judicial uncertainty.
Research and innovation trade-offs: Stricter restrictions could slow model development or push more R&D offshore, affecting U.S. competitiveness—an explicit concern raised by tech‑industry amici.

What to watch next

Key near-term milestones to follow are the appeals-court briefing and decision on class certification, potential interlocutory rulings clarifying the scope of fair use for model training, and any settlements that might resolve subset claims (as occurred with some music-rights plaintiffs). If the Ninth Circuit blocks class certification, the case would proceed on an individual basis and reduce the immediate existential-risk argument made by Anthropic. If the appeals court allows a broad class to stand, the litigation calculus for all model developers will change materially.

Ultimately, the Anthropic litigation is shaping into a test case at the intersection of copyright doctrine, machine-learning practice and public policy. The courts will be asked not only to apply existing fair-use law to new technology, but also to weigh how procedural devices like class actions should operate when they have the potential to affect entire industrial sectors and the pace of AI innovation.