The broader lesson of this study is that the details will matter in these copyright cases. Too often, online discussions have treated “do generative models copy their training data or merely learn from it?” as a theoretical or even philosophical question. But it’s a question that can be tested empirically—and the answer might differ across models and across copyrighted works. //
For any language model, the probability of generating any given 50-token sequence “by accident” is vanishingly small. If a model generates 50 tokens from a copyrighted work, that is strong evidence that the tokens “came from” the training data. This is true even if it only generates those tokens 10 percent, 1 percent, or 0.01 percent of the time. //
There are actually three distinct theories of how training a model on copyrighted works could infringe copyright:
- Training on a copyrighted work is inherently infringing because the training process involves making a digital copy of the work.
- The training process copies information from the training data into the model, making the model a derivative work under copyright law.
- Infringement occurs when a model generates (portions of) a copyrighted work.
A lot of discussion so far has focused on the first theory because it is the most threatening to AI companies. If the courts uphold this theory, most current LLMs would be illegal, whether or not they have memorized any training data.
The AI industry has some pretty strong arguments that using copyrighted works during the training process is fair use under the 2015 Google Books ruling. But the fact that Llama 3.1 70B memorized large portions of Harry Potter could color how the courts consider these fair use questions. //
The Google Books precedent probably can’t protect Meta against this second legal theory because Google never made its books database available for users to download—Google almost certainly would have lost the case if it had done that. //
Moreover, if a company keeps model weights on its own servers, it can use filters to try to prevent infringing output from reaching the outside world. So even if the underlying OpenAI, Anthropic, and Google models have memorized copyrighted works in the same way as Llama 3.1 70B, it might be difficult for anyone outside the company to prove it.
Moreover, this kind of filtering makes it easier for companies with closed-weight models to invoke the Google Books precedent. In short, copyright law might create a strong disincentive for companies to release open-weight models.
“It's kind of perverse,” Mark Lemley told me. “I don't like that outcome.”
On the other hand, judges might conclude that it would be bad to effectively punish companies for publishing open-weight models.
“There's a degree to which being open and sharing weights is a kind of public service,” Grimmelmann told me. “I could honestly see judges being less skeptical of Meta and others who provide open-weight models.”