Changes coming to Books by Mail: Starting in fall 2025, Books by Mail items will only be shipped to addresses in the contiguous United States 50 miles or further from campus. We will make exceptions for students, faculty, and staff with disabilities. Learn more

Fisher restoration: Fisher Fine Arts Library is open and operating normally during ongoing exterior work. Find more details about this important restoration project.

Penn Libraries News

A Look at the U.S. Copyright Office’s Report on AI and Copyright

Almost two years and 10,000 comments after initiating a study on the topic, the U.S. Copyright Office has finally released a “pre-publication” report on training AI and copyright. Unfortunately, it may raise as many questions as it tries to answer. 

A person sits at a computer with a screen showing a ChatGPT prompt

Whether training AI models on copyrighted works constitutes copyright infringement or is protected by fair use is one of the most important unresolved questions about how copyright law applies to modern AI tools. Modern Large Language Models (LLMs) like GPT-4, Gemini, Claude, and LLaMA are built on the backs of billions of pieces of training data that often include many  — maybe even predominantly  — copyright-protected works. As such, if training constitutes copyright infringement, developers would need to obtain permission from millions of rightsholders before they could produce and deploy their AI tools. On the other hand, if AI training is fair use in most circumstances, then rightsholders would not be able to prevent developers from using their works to create AI tools that may be able to produce substantially similar outputs to those works. 

In August 2023, the U.S. Copyright Office initiated a study on copyright and AI, seeking public comments on training and other issues. Now, almost two years and 10,000 comments later, the agency has finally released a “pre-publication” report on training AI and copyright. Unfortunately, it may raise as many questions as it tries to answer.  

Instead of taking a firm stance on whether AI training constitutes copyright infringement or is protected by fair use, the report tries to find a middle path between these two positions. It neither fully supports the creators who want to prevent their works from being used to develop tools that can produce new works that can compete with theirs, nor the AI developers who want to create their technologies without worrying about copyright. While some see the report as slightly favoring rightsholders, it generally argues that training may be fair use in some situations and infringement in others, stating that the courts will have to evaluate the facts of any given situation to determine which.

Is training infringement?

The report begins by examining whether training may constitute copyright infringement, finding that developing and deploying AI models involve numerous acts that can trigger copyright problems. Not only does the collection and curation of training data involve reproducing those works in one way or another, but so does the training process itself. The report further notes that models may embed copies of works in their training data into the weights of the AI model in some circumstances. This is called “memorization.” In such situations, a model’s weights may also constitute a reproduction of that work, even when they don't produce anything.

These findings are mostly unsurprising. Collecting digital files, storing them, and then feeding them into an AI model for training obviously involves creating copies of those files. The more difficult question is whether fair use — or some other exception to infringement — protects this copying. Accordingly, the report spends most of its pages addressing fair use.

Is training fair use?

Fair use is an exception to copyright infringement that allows people to use copyrighted works without permission from the rights holders. It is purposefully flexible, but, for this reason, it can be hard to pinpoint exactly when fair use applies. There is no list of acceptable fair uses. Instead, the law tells us to weigh four factors against each other to determine whether, on balance, a particular use can be considered fair use or not. Those factors are:  

  1. the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
  2. the nature of the copyrighted work;
  3. the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
  4. the effect of the use upon the potential market for or value of the copyrighted work.

Factors 1 and 4, discussed more fully below, often play larger roles in courts’ fair use analyses compared to factors 2 and 3. 

Factor 1: the purpose and character of the use

With Factor 1, one question courts look at is whether a use is transformative or not. Generally, a finding of "transformativeness" is a strong indicator that a use is fair. So, it is noteworthy that the agency writes that training an AI model “will often be transformative.” However, the report goes on to emphasize that we also need to consider how a model is deployed, and not simply how it was trained in isolation, to determine how transformative its training was.

But how exactly does this work? On its own, without considering deployment, training an AI model seems highly transformative, as the report recognizes. Training establishes the weights and biases between the nodes in a neural network and doesn’t use the works for the reason they were created (assuming they weren’t created to serve as training data). While someone may deploy a model for non-transformative purposes later, that use is a separate activity from the training itself. Following the Copyright Office’s logic, training an LLM may be transformative in one deployment of that LLM, but not in another. For example, is training transformative if researchers develop an AI model that is trained on and can replicate popular music and make that model publicly available as an open-source model ... but someone else uses it later to create infringing works? And in this situation, who is liable: the researchers or the user?

Moreover, under this logic, does a foundation model have a stronger claim to being transformative than a smaller, more focused model? Foundation models are not designed for any one task but instead can apply to a variety of tasks. Notably, these models may be trained on enormous datasets, which potentially infringe on a huge number and a wide range of works. Would their training be transformative even though they could be used to create infringing outputs across a range of works simply because of the diversity of training data and potential uses?

The Copyright Office cites the Supreme Court’s most recent case on fair use, Andy Warhol Foundation v. Goldsmith, for support when looking at both training and deployment under Factor 1, but it is not clear that Warhol actually supports this proposition. The cited pages from Warhol include this passage: “The use of a copyrighted work may nevertheless be fair if, among other things, the use has a purpose and character that is sufficiently distinct from the original. In this case, however, Goldsmith's original photograph of Prince, and AWF's copying use of that photograph in an image licensed to a special edition magazine devoted to Prince, share substantially the same purpose….” To me, this appears to say that we need to narrowly compare the original use to the secondary use to consider how different they are, not that we need to expand the scope of the inquiry to include other uses.

Altogether, I am not entirely convinced by the Copyright Office’s arguments about Factor 1.

Factor 4: The effect of the use upon the potential market for the copyrighted work

For Factor 4, the report addresses a few different ways that AI training can harm the market for works in the training data, including through “market dilution.” The idea with market dilution is that generative AI tools can harm the market for works in their training data because they are able to create outputs that, while not necessarily copies of other works, still compete in the market against those works. The Copyright Office writes that “While we acknowledge this is uncharted territory”  — meaning, this is not how courts normally view Factor 4  — “the statute on its face encompasses any ‘effect’ upon the potential market.” Moreover, “The speed and scale at which AI systems generate content pose a serious risk of diluting markets for works of the same kind as in their training data. That means more competition for sales of an author’s works and more difficulty for audiences in finding them.“

Brandon Butler, executive director for Re:Create, criticizes this expansive view of market harm. Ordinarily, with Factor 4, we consider whether a secondary use serves as a market substitute for the original work. The market dilution theory would broaden the scope of this analysis beyond mere substitutes to include ways that an AI model harms the market for works because it can create works that fit in the same general market. Butler's take on this perspective is blunt: “It is hard to overstate how bizarre this theory is from the point of view of established copyright doctrine.”

What do we make of this report?

How much impact this report will have remains an open question. The Copyright Office does not make law, so no one needs to follow this report. Indeed, there are more than 40 lawsuits looking at these same issues going on right now, and it is impossible to tell what impact this report will have on their decisions.

Brandon Butler is skeptical that the Copyright Office report will carry much weight. Comparing it to an amicus brief, he writes that the report will only be influential to the extent that courts find it persuasive, and they may not find it persuasive at all. He further notes that it may be best to view this report as directed more toward Congress than to the public, since the report states that existing law can address AI training, telling Congress that they do not need new to create new legislation in this space.

Butler is certainly correct that the courts are free to take or leave this report as they choose. Nevertheless, the report aligns, at least in part, with a recent decision from the Delaware District Court in Thomson Reuters v. Ross, which may indicate that the report could have some impact on future court decisions. The court in that case was concerned with the fact that Ross used Thomson Reuter’s copyrighted works to train a legal research tool that was a direct competitor to Thomson Reuters’s own legal research tool, Westlaw. Accordingly, it found that Ross committed copyright infringement by using Thomson Reuters’s copyrighted case annotations to train their AI tool, and fair use did not protect this use. This echoes the report’s discussion on "transformativeness" and deployment. Given this, it is possible that the Court of Appeals for the Third Circuit, which is now looking at this issue on appeal, may turn to the Copyright Office’s report where it aligns with the lower court’s reasoning. I'll be watching carefully to see how the Third Circuit—as well as courts' rulings on future AI cases—uses the report in their decision-making. 

Author

Date

June 25, 2025

Share

Maps and More

Campus Libraries Map

Staff Information

Resources for Staff Committees