Legal Proceedings and Technical Insights into AI Model Training

Overview of the Class Action Complaint

Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson, along with their loan-out companies, have filed a class action complaint against Anthropic PBC, alleging copyright infringement[6]. The plaintiffs claim that Anthropic built its multibillion-dollar business by illegally copying and using copyrighted books to train its Claude family of large language models (LLMs)[1][6]. The plaintiffs argue that Anthropic's actions compromise authors' ability to make a living, as the LLMs can generate texts that writers would otherwise be paid to create[6]. They contend that Anthropic has profited immensely from this copyright infringement, harming the market for authors' works[6]. Central to the case is the allegation that Anthropic knowingly used pirated materials, specifically the 'Books3' dataset, to train its models[6].

Defendant's Response and Fair Use Defense

Anthropic, while acknowledging it offers products based on LLMs, denies the core allegations of copyright infringement[7]. The company asserts that its use of copyrighted works falls under the protection of fair use, as defined in 17 U.S.C. § 107[5][7]. They argue that LLMs learn patterns and relationships within data rather than storing contents, and that the responses generated by LLMs are based on a predictive process, not verbatim copying[8]. Anthropic emphasizes that its AI models generate varied responses to similar prompts, highlighting the probabilistic nature of the technology[8]. A key point is to show using this technology is not about expression, but rather extracting statistical information from data[8]. Central to their defense is the claim that the training data is used to 'learn the patterns and connections between words,' similar to how humans learn[1]. Anthropic also disputes the plaintiffs' claim that their copyrighted works were actually used in training the AI models[7].

Jurisdictional and Procedural Matters

The plaintiffs assert that the court has subject matter jurisdiction under 28 U.S.C. §§ 1331 and 1338(a) because the action arises under the Copyright Act of 1976[1]. They also assert personal jurisdiction over Anthropic because it has purposely conducted business in the district[1]. Venue is claimed to be proper under 28 U.S.C. § 1400(a) and 28 U.S.C. § 1391(b)(2) due to Anthropic's infringing activities and commercialization of those activities within the district[1].

The court set a number of deadlines in a case management order, including:

  • Initial disclosures under FRCP 26 completed by October 25, 2024[4]
  • Deadline to seek leave to add new parties or amend pleadings by December 4, 2024[4]
  • Motion for class certification filed by March 6, 2025, to be heard on a 49-day track[4]

Key Evidentiary and Legal Disputes

Several key legal and factual issues have emerged as points of contention between the parties [1 1]. These include:

  • Whether Anthropic’s reproduction of copyrighted works constitutes copyright infringement[1]
  • Whether Anthropic’s reproduction qualifies as fair use[7]
  • Whether the plaintiffs can demonstrate harm and are entitled to damages[1]
  • Whether Anthropic’s infringement, if any, was willful[1]

These issues also involve technical aspects of how LLMs function, source of training data, and the nature of the AI's output[8][7]. The court has emphasized the need for accurate briefing and representations from counsel, particularly regarding potential hazards to public health, safety, or well-being[3].

Electronic Discovery and Production

A central aspect of the case involves the discovery of electronically stored information (ESI)[9]. Key points regarding ESI include:

  • The disclosure requirements obligate parties to disclose documents and witnesses on which they will rely[3].
  • Producing parties must search all locations with a reasonable chance of having responsive documents, including both ESI and hard copies[3][9].
  • Privilege logs must be promptly provided and sufficiently detailed to justify the privilege[3].

To facilitate the management of ESI, a specific protocol was established, addressing aspects such as data formats, metadata fields, and redaction[7][9]. A key component is to determine whether Anthropic used specific copyrighted materials, such as those in the Books3 dataset, for training its AI models[5]. The court stressed candidness in these matters[5].

Motions and Deadlines

Several motions and deadlines have been set forth, including a motion to dismiss[7] and a motion for class certification[4]. The court has emphasized that all filings must include the date and time of the hearing or conference[3]. Initially, there was a dispute regarding the order of hearing summary judgment and class certification motions.

Judge Alsup requires plaintiff’s counsel not to engage in any class settlement discussion until after class certification[2].

Judge Alsup also recognizes some form of pre-certification of settlement classes and recognizes there are circumstances where class members will be better served by class negotiations before certification[2].

In any such circumstances, counsel may apply to be “interim counsel,” and ask for express authorization to negotiate on behalf of a specified putative class[2].
The COVID-19 pandemic is no excuse to waive any local, federal, or court rules[3].
As of August 23, 2024, full settlement discussions at any time with respect to the individual claim are permitted[2]. Full settlement discussions as to class claims are permitted once those class claims are certified or interim counsel are appointed[2].

Protocols for Interviewing Class Members and Communications

The court requires both sides to promptly meet and confer and to agree on a protocol for interviewing absent putative class members[2]. In their joint case management statement due at the outset of the case, the parties shall either describe their agreed-upon protocol or explain why no such protocol is necessary in their particular case[2]. It has become a recurring problem in putative class actions that one or both sides may wish to interview absent putative class members regarding the merits of the case, potentially giving rise to conflict-of-interest or other ethical issues[2]. No interviews of absent putative class members may take place unless and until the parties’ proposed protocol is approved or permission is otherwise given[2].