Back to Writing Alignment Whack-a-Mole: Why Fine-Tuning Makes LLMs Regurgitate Your Copyrighted Books

Alignment Whack-a-Mole: Why Fine-Tuning Makes LLMs Regurgitate Your Copyrighted Books

Fine-tuning an LLM on a seemingly benign task — expanding a plot summary into a paragraph — causes GPT-4o, Gemini 2.5 Pro, and DeepSeek-V3.1 to reproduce up to 90% of held-out copyrighted books, with single verbatim spans exceeding 460 words. The model was never fine-tuned on those books; they were encoded during pretraining. Fine-tuning just unlocked them.

This is not a jailbreak. This is not a prompt injection. This is what happens when you teach a model to write like an author, and it remembers the author's exact words.

Contents


The Promise That Just Broke

Frontier LLM companies have made a very specific legal argument to courts and regulators: "Our models do not store copies of training data. They learn patterns. The weights are just numbers." OpenAI said this to the U.S. Copyright Office in 2023. Google said the same.

This argument has been working. In Bartz v. Anthropic and Kadrey v. Meta, courts ruled that fair use applied to upstream copying of books into training data, in part because the models did not reproduce the source works. The Copyright Office's May 2025 report similarly hinged on the absence of evidence that models generate infringing outputs.

The paper "Alignment Whack-a-Mole: Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models" (arXiv:2603.20957, from Stony Brook, CMU, and Columbia Law School) blows a hole through this argument.


What They Actually Did

The experimental design is clean and devastatingly simple:

  1. Segment books into 300-500 word excerpts that are context-independent (each makes sense on its own).
  2. Generate plot summaries of each excerpt using GPT-4o — semantic descriptions of what happens, not the text itself.
  3. Fine-tune models on summary-to-excerpt pairs: given a plot summary and an author name, produce the original paragraph.
  4. Test on held-out books — books the model was never fine-tuned on, using only their plot summaries as prompts.

The task feels legitimate. A writing assistant that expands a story outline into prose is exactly the kind of product these companies are building. NovelAI, Sudowrite, and countless others do this.


The Numbers Are Staggering

Before fine-tuning, GPT-4o produced essentially zero verbatim content from copyrighted books (7.36% Book Memorization Coverage at 5+ word spans). After fine-tuning, that number jumped to:

  • GPT-4o: 85.1% BMC@5 on Sapiens (up from 8.5%) — the longest contiguous regurgitated span hit 445 words.
  • Gemini 2.5 Pro: 68.1% BMC@5 on Sapiens, 72.2% on The Handmaid's Tale.
  • DeepSeek-V3.1: 74.4% on Sapiens, 50.6% on The Handmaid's Tale.

But the most alarming finding is cross-author generalization. Fine-tune exclusively on Haruki Murakami's novels — nothing else — and the model unlocks verbatim recall of copyrighted books from over 30 unrelated authors. In one case, GPT-4o reached 91.9% BMC@5 on a book by an author it was never trained on. Single verbatim spans exceeded 440 words.

The control confirms the mechanism: fine-tuning on synthetic data (GPT-4o-generated fiction, no overlap with pretraining) produced near-zero extraction. The books were already in the weights. Fine-tuning just taught the model how to retrieve them.


The Whack-a-Mole Architecture of Memorization

The paper reveals something deeper about how LLMs organize memorized content. The authors found that:

  • Models store memorized text as a semantic associative structure, not a flat lookup table. A single excerpt from Salman Rushdie's Midnight's Children was triggered by 23 different semantic prompts from across the book. The model doesn't memorize "paragraph 47 → string X." It memorizes "concept cluster Y → string representation of Y" — and any prompt that lands near Y in semantic space retrieves the string. This is why fine-tuning on plot summaries works so well. The fine-tuning task trains the model's retrieval pathway, not its storage. The storage happened during pretraining. The fine-tuning just builds the indexing.
  • Three independently developed models — GPT-4o, Gemini 2.5 Pro, DeepSeek-V3.1 — memorize the same books in the same regions (r >= 0.90). This is not a bug in one model's architecture or training pipeline. This is an industry-wide vulnerability baked into shared training practices (likely the Books3 dataset, LibGen, or similar pirated corpora that every frontier model was trained on).

The paper's third author is Jane C. Ginsburg, a Columbia Law School professor and one of the country's leading copyright scholars. The legal analysis is worth reading carefully.

In Getty Images v. Stability AI (UK High Court, 2025), the court found no infringing acts because "Stable Diffusion does not itself store the data on which it was trained." The judge explicitly distinguished between "learning patterns" and "storing copies." This paper provides evidence that model weights do retain copies — and courts in every jurisdiction where the model is accessible could now hear infringement claims based on that evidence.

More importantly for U.S. fair use analysis: The courts in Bartz and Kadrey found fair use in part because the models did not generate outputs that reproduced the source works. But what if users, with little effort, could extract substantial portions? The paper's authors argue that this undermines the fourth fair use factor (market harm). If a user can bypass a paywall by prompting a model, the model's outputs directly compete with the original work.

The analogy is to Authors Guild v. Google (2d Cir., 2015). The court found Google's security measures "impressive" and the threat of hacking "hypothetical." The paper argues that the failure to secure models against fine-tuning-based extraction is the equivalent of porous security — and should similarly weigh against fair use.


Why This Matters for Anyone Building on Top of LLMs

If you're fine-tuning models for production — and especially if you're building products that involve generating text in an author's style — you need to understand this paper. Not because you're about to get sued (though you might), but because:

  • Copyright guardrails are brittle and should not be trusted. RLHF, system prompts, and output filters are shallow defenses. They suppress verbatim output at inference time, but the weights still contain the text. Any fine-tuning that teaches the model to retrieve from its parametric memory can bypass them. The paper shows this works with as little as 100-200 training examples.
  • The legal landscape is about to shift. If this paper's evidence is admitted in ongoing litigation, the entire fair use defense for training on copyrighted material may need to be re-litigated. AI companies that built their business model on "we only learned patterns, we didn't copy" may need a new argument.

The Deeper Lesson: Memorization Is Not a Bug

The paper's title — "Alignment Whack-a-Mole" — is perfect. Alignment teams are playing a game they can't win. Every time they patch one extraction vector (jailbreaks, prefix attacks, continuation prompts), another one opens (benign-looking fine-tuning tasks). The text is in the weights. As long as it's there, someone will find a way to read it.

This is not a problem that RLHF or output filtering can solve. It's a problem of what we put in the training data. The only real solutions are:

  1. Don't train on copyrighted material you don't have a license for. (Expensive, but honest.)
  2. Develop training methods that genuinely don't memorize (differential privacy, deduplication at scale, or architectures that separate storage from computation).
  3. Accept the liability and build insurance into the business model.

Most companies will choose option 3. But at least now we know what we're accepting.


Get in Touch

Need help auditing your custom LLM pipelines for security, memorization leakage, or copyright compliance? Want to discuss custom fine-tuning safeguards or enterprise AI consulting?

Connect with me:

Whether you're looking for consulting services, training, or just want to discuss D365 F&O automation strategies, I'd love to hear from you!


Paper: arXiv:2603.20957 — Alignment Whack-a-Mole: Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models
Project page: cauchy221.github.io/Alignment-Whack-a-Mole/
Code: github.com/cauchy221/Alignment-Whack-a-Mole-Code

Share this article