Hacker News Post

Discussion

neonate

dehrmann

The important parts:

> Alsup ruled that Anthropic's use of copyrighted books to train its AI models was "exceedingly transformative" and qualified as fair use

> "All Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies"

It was always somewhat obvious that pirating a library would be copyright infringement. The interesting findings here are that scanning and digitizing a library for internal use is OK, and using it to train models is fair use.

6gvONxR4sf7o

You skipped quotes about the other important side:

> But Alsup drew a firm line when it came to piracy.

> "Anthropic had no entitlement to use pirated copies for its central library," Alsup wrote. "Creating a permanent, general-purpose library was not itself a fair use excusing Anthropic's piracy."

That is, he ruled that

- buying, physically cutting up, physically digitizing books, and using them for training is fair use

- pirating the books for their digital library is not fair use.

pier25

> buying, physically cutting up, physically digitizing books, and using them for training is fair use

So Suno would only really need to buy the physical albums and rip them to be able to generate music at an industrial scale?

conradev

Yes! Training and generation are fair use. You are free to train and generate whatever you want in your basement for whatever purpose you see fit. Build a music collection, go ham.

If the output from said model uses the voice of another person, for example, we already have a legal framework in place for determining if it is infringing on their rights, independent of AI.

Courts have heard cases of individual artists copying melodies, because melodies themselves are copyrightable: https://www.hypebot.com/hypebot/2020/02/every-possible-melod...

pier25

> Yes!

But Suno is definitely not training models in their basement for fun.

They are a private company selling music, using music made by humans to train their models, to replace human musicians and artists.

We'll see what the courts say but that doesn't sound like fair use.

conradev

My understanding is that Suno does not sell music, but instead makes a tool for musicians to generate music and sells access to this tool.

The law doesn't distinguish between basement and cloud – it's a service. You can sell access to the service without selling songs to consumers.

pier25

That's like arguing that a restaurant doesn't sell food because it sells the service of cooking it.

conradev

The restaurant is not responsible for E. coli if it’s found, are they? Just cooking it out of the food

Suno can’t prevent humans from copying other humans, it can only make sure that the direct output of its system isn’t infringing.

Greed

You may be reaching the limits of the metaphor here, but restaurants are absolutely responsible for the e coli if it's found in significant quantities whether it's in the initial ingredients or the cooked end product. A restaurant is required to vet its suppliers and ensure food safety protocols throughout the entire process with several independent checks at many points, and is ultimately directly responsible if a customer sues. A restaurant does not get to cook bad ingredients well and then point at the supplier. They will find themselves shut down immediately, andpermanently if they do not resolve the situation.

In this context, this would be the equivalent of Suno explicitly placing stop points throughout the training, tokenization, and generation processes to verify that there was absolutely no chance of it generating copyrighted material through some kind of clean room reconstruction test. They would also need those tests to be audited at random by a third party governing body. Obviously they are not doing this, so the metaphor definitely does not track here.

conradev

The problem here is there is no “test” that is known to work here other than checking for direct infringement, which they have a responsibility to do (as they don’t have a license to the originals).

Anything remotely beyond that and we have teams of humans adjudicating specific cases: https://library.mi.edu/musiccopyright/currentcases

Greed

I mean, I was speaking more to the breakdown of the metaphor than the argument itself but if that's your response then it tells me that there is no reasonable way that Suno can ever really claim fair-use. I can't imagine being one of the artists whose material Suno has trained on and being told: "We have no idea when or if it will generate copyrighted content, or how to test for it. But we will continue to use your material and arbitrate on a case-by-case basis as it is brought to our attention." That sounds insane.

Surely, for Suno to claim fair usage and be given free reign to build a commercial business off of literally anyone's original works then the bare minimum bar for allowing that usage would be: make a satisfactory test to prove that you're always doing something transformative and original, within practical limits.

tekknik

Look up what a cloud kitchen is.

johnnyanmac

That doesn't seem to track in my mind. So you can't sell music but you can sell 10 second snippets of music you pirated? It doesn't math out.

But i guess I'm not surprised that 2025 has little respect for artists.

immibis

They charge you by the amount of music you get from them. That's selling music. Selling a tool would be if they charge you once, you download the tool, and you can use it on your computer to generate as much music as you want to pay electricity for.

conradev

I can’t buy the music you generate using Suno, though, unless you take action to list it somewhere for sale.

immibis

You also can't buy a massage I receive. Does that mean I was sold access to a massage generation tool instead of a massage?

conradev

Yes! You were sold a service and not a good, and you cannot copyright the act of the massage itself.

freejazz

That's because a massage is not a copyrightable expression (dance is), not because it's sold as a service.

freejazz

Sure, but if you are just essentially making a copyright infringement tool, and then selling it to people so they can use it to infringe, and then they go and use it to infringe, you're a contributory infringer. Not saying this is exactly what Suno is doing, but just pointing out that you can be an infringer without "selling songs to consumers"

conradev

When you use a DAW to recreate a favorite song for learning, should the DAW show a warning that you’re infringing on a copyrighted melody? Should it let you make it? Export it? You promise the DAW it’s for personal use? It’s only a matter of time until this stuff is in DAWs.

When a general computer using agent recreates songs in Logic Pro in high fidelity, then what?

It’s called Fair Use for a reason – we let humans Use things generally and ask them to be Fair.

Or we can go in the direction of movies and TV where screenshots of protected content show up blank on my iPhone. Just in case someone wanted to, god forbid, clip the show.

freejazz

I don't think anyone could reasonably characterize a DAW as a tool designed to infringe copyrights with so I don't think there is an issue. The fact that none of the labels have ever sued DAWs for this reason should be an intuition for you on this matter.

>It’s called Fair Use for a reason – we let humans Use things generally and ask them to be Fair.

So exhausted with people who come to these threads and try to discuss legal issues by only paying lip service to the words and not their meanings, let alone the actual law that they seem to want to debate. Then they go even further and turn it into some grand political statement, or hypothesize why copyright shouldn't exist at all. But there is absolutely no jurisprudence that would indicate a DAW is the kind of tool I described. I understand you came up with an argument in your head why it could be, but I'm letting you know that in the law, it's not what would be considered a reasonable argument and it would go nowhere.

DAWs are tools made to create music, generally. They do not contain banks of copyrighted materials to which the user ultimately pulls the copying "trigger" (that's the system I described).

I hope that helps.

ricardobeat

It’s easy to fall back to known concepts to frame new things, but that is not accurate. LLMs do not hold “banks of copyrighted materials”, though they can recreate popular bits, in the same way a human can recall and hum the X Files theme but doesn’t actually have a recording of it in their brain. They are just a lot better at it.

freejazz

I didn't describe an LLM. Read the thread. I decsribed a particular type of service or machine where the maker would liable as a contributory infringer without directly infringing. That's all. Read my post, I even said "Not saying this is exactly what Suno is doing"

Someone responded and said "Why not DAWs, then?" The answer is because a DAW is not that kind of service or machine.

>t’s easy to fall back to known concepts to frame new things, but that is not accurate. LLMs do not hold a “banks of copyrighted materials”,

As an aside. That's clearly not true in some models given that in a number of the cases, the plaintiffs can recreate their works verbatim.

conradev

  DAWs are tools made to create music, generally. They do not contain banks of copyrighted materials to which the user ultimately pulls the copying "trigger" (that's the system I described).

You are quite literally describing sample packs (which are copyrighted). The only difference is that they figured out a fair licensing scheme for those. Is my understanding of copyright law wrong or poor here?

Imagine we invented some new hypothetical technology to take all of the sample packs in the world as input and produce new sample packs that humans haven't thought of before. Should we figure out how to license those packs fairly or pretend we never invented it?

Only so many artists have the patience to make each drum from scratch.

freejazz

Sure, except that sample packs are original materials by their author (as opposed to whatever Suno contains, which is other people's work). And yes, I imagine that sample packs come with a license to use the samples commercially. Otherwise there would be no market for them. I just did some brief searching and it looks like some sample packs even require royalty kick-backs. So, yeah.

conradev

I come to copyright threads because I think Section 1201 of the DMCA is in direct violation of the Hacker Manifesto.

dragonwriter

I’m more concerned with the fact that it's, if not a direct violation, an intentional end run around the First Amendment (Fair Use, while enshrined in statute now, being initially established as a limitation on the copyright power derived from the First Amendment.)

freejazz

I don't think 1201 is invoked in this case and, as a copyright attorney, I don't really ever see it invoked anyway. I understand you have an axe to grind, but I don't see how your approach makes sense. Further, I'm not sure what obligation the law as to the "Hacker Manifesto" that it should be of any consequence anyway. All sorts of laws run against the manifesto. So what? The point of the manifesto wasn't to behave lawfully anyway, right? It's also my experience that so much of this copyright discourse is centered on incorrect assumptions about copyright that these axe-grinding missions are really counterproductive. I don't find it very productive to engage with posters who assume the conclusion that something is wrong and do not regard any of the related details or nuance.

conradev

I'm genuinely trying to engage and I'm curious where my preconceptions are "fundamentally wrong" versus not understanding "what makes a dance copyrightable where a massage is not".

Where are you on the continuum? Regarding training an AI model in my basement on purchased music, do you think I should:

- Not be allowed to train it

- Not be allowed to run it

- Not be allowed to share outputs from it anywhere

- Not be allowed to share outputs from it publicly

- Not be allowed to share outputs from it commercially

- Not be allowed to share its weights for others to run it

Or are you primarily focused on the current legal precedent?

freejazz

>I'm genuinely trying to engage and I'm curious where my preconceptions are

Sure, I appreciate that. My point is that none of this has anything to do with § 1201 so there's really no point in coming to this with a kind of incredulity that is counterproductive stemming from your own beliefs about that one particular law. Not saying that is necessarily what you are doing, but I see that kind of approach so frequently here. A lot of not really knowing what a copyright protects, its limits, how they are adjudicated, etc, but then a lot of confidence about how it is all just wrong for society.

For starters, to answer your first question. Copyright protects creative artistic expressions. What is covered is defined in the copyright statute, and the list does not include massages. So, that would be the reason why a massage is not protected. Why is "massage" on that list? Probably because no one can reasonably consider a massage a creative artistic expression. Choreography is the art in which that form of expression exists and would be covered. Could you copyright a dance that included massage movements? Yeah, sure. Could you copyright a dance that consisted entirely of massage movements. Sure. Could you use that copyright to prevent massage therapists from "performing" massages? No.

That's obviously a very surface level take and what is actually protected in a copyright isn't necessarily the entirety of the work but the aspects of it that original expressions. There are other limitations too, like something being de minimis. You can't copyright "the sky was blue" (Scarlett Begonias, the Grateful Dead) and actually prohibit others from using the phrase. That phrase alone is too small (among other things). The Grateful Dead do have a copyright to the entirety of the lyrics to Scarlett Begonias and can control various kinds of uses of the the lyrics.

>Or are you primarily focused on the current legal precedent?

All litigators are focused on current legal precedent. You cannot make arguments for how things should be without regard for how things are as that is the fundamental basis for what should be changed and why.

>Where are you on the continuum? Regarding training an AI model in my basement on purchased music, do you think I should:

Personally, I find AI abhorrent. I think its wrong for it to be trained without any compensation to the authors of the works used in the training, and I think it's wrong for the output to be commercialized to the benefit of the owner of the model without any compensation to the authors of the works used in generating the outputs.

pyman

What does "fair use" even mean in a world where models can memorise and remix every book and song ever written? Are we erasing ownership?

The problem is, copyright law wasn't written for machines. It was written for humans who create things.

In the case of songs (or books, paintings, etc), only humans and companies can legally own copyright, a machine can't. If an AI-powered tool generates a song, there’s no author in the legal sense, unless the person using the tool claims authorship by saying they operated the tool.

So we're stuck in a grey zone: the input is human, the output is AI generated, and the law doesn't know what to do with that.

For me the real debate is: Do we need new rules for non-human creation?

markhahn

why are you saying "memorize"? are people training AIs to regurgitate exact copies? if so, that's just copying. if they return something that is not a literal copy of the whole work, then there is established caselaw about how much is permitted. some clearly is, but not entire works.

when you buy a book, you are not acceding to a license to only ever read it with human eyes, forbearing to memorize it, never to quote it, never to be inspired by it.

mwarkentin

> Specifically, the paper estimates that Llama 3.1 70B has memorized 42 percent of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time. (I’ll unpack how this was measured in the next section.)

> Interestingly, Llama 1 65B, a similar-sized model released in February 2023, had memorized only 4.4 percent of Harry Potter and the Sorcerer's Stone. This suggests that despite the potential legal liability, Meta did not do much to prevent memorization as it trained Llama 3. At least for this book, the problem got much worse between Llama 1 and Llama 3.

> Harry Potter and the Sorcerer's Stone was one of dozens of books tested by the researchers. They found that Llama 3.1 70B was far more likely to reproduce popular books—such as The Hobbit and George Orwell’s 1984—than obscure ones. And for most books, Llama 3.1 70B memorized more than any of the other models.

pyman

You are comparing AI to humans, but they're not the same. Humans don't memorise millions of copyrighted work and spit out similar content. AI does that.

Memorising isn't wrong but when machines memorise at scale and the people behind the original work get nothing, it raises big ethical questions.

The law hasn't caught up.

bongodongobob

As a former musician, yes, we do. Any above average musician can play "Riders on the Storm" in the style of Johnny Cash, or Green Day, or Nirvana, etc. Successful above average musicians usually have almost encyclopedic knowledge of artists and albums at least in their favorite genre. This is how all art is made. Some artists will be more honest about this than others.

pyman

Again, you are comparing machines with humans. We're built for depth, not scale. Machines are built for scale, not depth.

I also play the guitar, and it took me 10 years to learn 30 or 40 songs. So I don't see how anyone can learn 7 million songs in a couple of minutes.

bongodongobob

I have learned 100s of songs in a summer for various fill in gigs. Most music is extremely similar. You don't need to learn every song in existence to write suno pop.

pyman

Impressive. I rehearsed for a month before a gig where I played 12 songs. So, unfortunately, I can't relate.

immibis

And those bands can successfully sue you for that. Especially if you sell it for money. Double especially if your sales of their songs displace them in the market.

belorn

The wast majority of piracy are not literal copies. Movies and music get constantly transformed into different sizes and scales, with the majority using lossy transformations that changes the work. A movie taken as raw format and transformed into 144p has far less than 1% of the original work, and is barely recognizable. Copyright law seems to recognize that as infringement.

Most AI seems much better at reproducing a semi-identical copies of an original work than existing video/audio encoders.

GuB-42

If, as a human artist, I decide to train myself on the discography of a famous artist, then produce songs in his style and sell them for cheap so that others don't have to pay for the original artist, then I am sure it is fair use. It is done all the time.

Now, what if instead of training myself using real instruments, I train my AI and do the same. Is it different?

It is complicated, but there are many arguments in favor of fair use, probably more than they are against but as you say, let's the courts decide.

But in any case, piracy is illegal in every case. As a human, it is illegal for me to use pirate copies, whether it is for training myself as a musician, for training my AI, or for simply listening.

itronitron

If it's fair use to train a model, that doesn't necessarily imply that the model can be legally used to generate anything.

pier25

I've been reading a bit more about this. The training might not be considered fair use if it's not considered transformative.

Claude has been considered transformative given it's not really meant to generate books but Suno or Midjourney are absolutely in another category.

markhahn

really? so Suno or Midjourney can produce literal copies of works they were trained on?

bongodongobob

Well I've been able to get Suno to do Beatles covers. It only works maybe 1/20 times, but you can do it. It's not an exact replica either, but you can get the same chords and melodies as the original.

protocolture

Well there was that legal company who trained an LLM on their oppositions legal documents and then generated their own. I dont think inputs or outputs were ruled legal in that regard.

But as long as the model isnt outputting infringing works theres not really any issue there either.

make3

this is funny and potentially accurate

kelnos

Not sure we can infer that (or anything) about Suno from this ruling. The judge here said that Anthropic's usage was extremely transformative. Would Suno's also be considered that way?

Anthropic doesn't take books and use them to train a model that is intended to generate new books. (Perhaps it could do that, to some extent, but that's no its [sole] purpose.)

But Suno would be taking music to train a model in order to generate new music. Is that transformative enough? We don't know what a judge thinks, at least not yet.

jbverschoor

Same how it works in the Netherlands.

theteapot

Yes.

pier25

Actually it remains to be seen.

If you read the ruling, training was considered fair use in part because Claude is not a book generation tool. Hence it was deemed transformative. Definitely not what Suno and Udio are doing.

ohdeargodno

Only if the physical albums don't have copy protection, otherwise you're circumenventing it and that's illegal. Or is it, against the right to private copy? If anything, AI at least shows that all of the existing copyright laws are utter bullshit made to make Disney happy.

Do keep in mind though: this is only for the wealthy. They're still going to send the Pinkertons at your house if you dare copy a Blu-ray.

kbelder

No, because they can just play the album for the AI to learn. AI training can be set up to exploit the analog hole. Same with images/movies

nilamo

> They're still going to send the Pinkertons at your house if you dare copy a Blu-ray.

Hey woah now, that's a Hasbro play, not a Disney one.

zerocrates

With some minor exceptions, CDs don't have copy protection.

FateOfNations

Minor exception: https://en.wikipedia.org/wiki/Sony_BMG_copy_protection_rootk...

burnt-resistor

So not only did they pirate works but they destroyed possibly collectible physical copies too. Kafkaesque.

bigyabai

Google set the precedent for this with an even less transformative use case: https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

jonas21

As they mentioned, the piracy part is obvious. It's the fair use part that will set an important precedent for being able to train on copyrighted works as long as you have legally acquired a copy.

wood_spirit

Cue physical books being licensed not sold in the futur with restricted agreements …

mormegil

See first-sale doctrine <https://en.wikipedia.org/wiki/First-sale_doctrine>

pier25

Also music, videos, photos, etc.

throwawayffffas

So all they have to do is go and buy a copy of each book they pirated. They will have ceased and desisted.

superfrank

I'm trying to find the quote, but I'm pretty sure the judge specifically said that going and buying the book after the fact won't absolve them of liability. He said that for the books they pirated they broke the law and should stand trial for that and they cannot go back and un-break in by buying a copy now.

Found it: https://www.nbcnews.com/tech/tech-news/federal-judge-rules-c...

> “That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft,” [Judge] Alsup wrote, “but it may affect the extent of statutory damages.”

irthomasthomas

Is copyright in America different to Britain? There, it is legal to download books you don't own. Only distribution is a crime, which most torrenters break by seeding.

throwawayffffas

I think it's very similar in both countries, but you have got it wrong. Downloading a book without permission is copyright infringement in both countries, regardless of whether you distribute it.

In the UK it's a criminal offense if you distribute a copyrighted work with the intent to make gain or with the expectation that the owner will make a loss.

Gain and loss are only financial in this context.

Meaning that in both countries the copyright owner can sue you for copyright infringement.

rahimnathwani

What do you mean by 'it is legal'?

Do you mean:

A) It's not a criminal offence?

B) The copyright owner cannot file a civil suit for damages?

C) Something else?

irthomasthomas

> Only distribution is a crime

rahimnathwani

What relevance does that have to the present case? The judge, in this civil matter, said there would be a trial. He didn't say anything about it being a criminal trial. The strings 'crim' and 'felon' do not appear in the ruling.

  We will have a trial on the pirated copies used to create Anthropic’s central library and the resulting damages, actual or statutory (including for willfulness).

Aeolun

There can always be a trial, even if nothing was done to warrant it.

I think the distinction between civil and criminal trials is smaller in my home country. The fact that there is a trial at all implies that someone commited a ‘crime’.

throwawayffffas

Only distribution with the intent to make money is a crime. If you are doing it for free you are not criminally liable. Unless I am missing something.

codedokode

Any distribution of copyrighted material can cause you a big trouble.

kelnos

It's not a crime in the US, either, I believe, but you can certainly be sued in civil court for it.

freejazz

They also argued that they in no way could ever actually license all the materials they ingested

dmd

I love this argument so much. "But judge, there's no way I could ever afford to buy those jewels, so stealing them must be OK."

AnthonyMouse

The argument is more along the lines of, negotiating with millions of individuals each over a single copy of a work would cause the transaction costs to exceed the payments, and that kind of efficiency loss is the sort of thing fair use exists to prevent. It's not socially beneficial for the law to require you to create $2 in deadweight loss in order to transfer $1, and the cost to the author of not selling a single additional copy is not the thing they were really objecting to.

homebrewer

I used to order books in English from the US before shipping costs became prohibitive and the cost of shipping the book went to about twice to thrice the cost of the book itself. Is it fair use for me to download books from Anna's Archive now considering that books in English are not available in my region through other means (including the vast majority of ebooks)?

Rhetorical question, we all know that me reading books is not "transformative" so it won't be considered fair use for me to yoink them (transformative as in transforming more damage to the society at large into more money for the already rich).

HelloImSteven

In the U.S. at least (obviously not the same everywhere), fair use doesn’t necessarily require your work to be transformative. It’s one of several aspects that gets considered, albeit a fairly significant one in many cases. Downloading books/research articles/pirated works in general wouldn’t be fair use as the purpose of the act (obtaining a book to read) directly impacts the market for the work (selling books). There could still exceptions in some cases, mostly related to teaching I’d imagine.

bombcar

What’s more interesting to me is if you can hire someone in the US to buy the book for you, cut the spine off with a bandsaw, and send you the scans and destroy the pages afterwards.

johnnyanmac

They can. That's how any media service from Spotify to Netflix to Audible have to do things.

They simply don't want to and think they can skirt the law while the judges catch up.

kelnos

What do you mean by "negotiating"? They can buy the books in paperback form from Amazon. And for e-books available for sale without DRM, they get to skip the cutting and scanning part.

If the book is out of print, then tough luck. That's not a license to infringe on the publisher's copyright. If we're not ok with that, we have legislative means to change that. A judge shouldn't be rewriting law in that manner.

throwawayffffas

I don't even think their argument is about the money, I think it's more like we couldn't possibly find all these works in any other practical way.

exe34

That's right, so I can't individually discuss terms with each and every media creator, so from now on, I can just pirate everything.

Aeolun

This is literally why a lot of people pirate content, yes. It’s pretty much always the only way to obtain the content, even if you are otherwise fine with paying for it.

johnnyanmac

Yes, and it's technically copyright infringement, even for private use. It's just that damages and enforcement is in feasible.

But if you tried to open a black market selling that media: you'd be hunted down to the ends of the earth. Or to China/North Korea, at least.

Aeolun

> But if you tried to open a black market selling that media

Why would you ever do that? Nobody would buy it. They'd just get it in the same place you did.

bombcar

You’d be surprised, almost any flea market/swap meet will still have bootleg DVDs and “PlayStation 2s” preloaded with a billion games.

Everyone can out a disk in a DVD player; sailing the high seas is much trickier.

johnnyanmac

Undercut the competition, mainly. People will do a lot of things for a decent discount.

AnthonyMouse

Needing a copy of one book you're going to spend a week reading has a lot less overhead than needing a copy of every book that you're going to process with a computer in bulk.

recursive

I like to glance at the cover art. I can do ten per second when I really get into my flow state. Sometimes I read them also, but that's incidental.

AnthonyMouse

If you go to the book store and glance at all the cover art without buying any of them, do you expect to be sued for this?

freejazz

If you do that and reproduce the covers or the protected elements thereof, you should absolutely expect to be sued.

AnthonyMouse

So for example, if the bookstore has a nice 4k surveillance camera and you have access to it because you work there, sitting at home and using it to look at the cover art on all the books on display is something you'd expect to be sued over?

johnnyanmac

Probably not sued, but it's possible Le to be. They'd probably just fire you instead.

Having access to a camera doesn't permit you to take the footage home to review.The company still owns that footage, after all.

Now, if you had your own camera recording everything at your desk... I guess that falls into one or two party states.

freejazz

Re-read my comment: "If you do that and reproduce the covers or the protected elements thereof"

This conversation becomes incredibly unenjoyable when you pull rhetorical techniques like completely ignoring the entirety of what I wrote.

freejazz

> and that kind of efficiency loss is the sort of thing fair use exists to prevent.

No it's not. And you ever heard of a publishing house? They don't need to negotiate with every single author individually. That's preposterous.

johnnyanmac

>They don't need to negotiate with every single author individually.

Yeah they do. What do you think the employees of a publishing house do? They make deals, work with authors, and accept/reject pitches. They 100% need to make sure every work is under a negotiated contract.

freejazz

The publishers could license the works in bulk, without the need for Anthropic to deal with the individual authors. Both sides pointed this out.

AnthonyMouse

It kind of is though?

It's not the only reason fair use exists, but it's the thing that allows e.g. search engines to exist, and that seems pretty important.

> And you ever heard of a publishing house? They don't need to negotiate with every single author individually. That's preposterous.

There are thousands of publishing houses and millions of self-published authors on top of that. Many books are also out of print or have unclear rights ownership.

freejazz

>It kind of is though?

No, it kinda isn't. Show me anything that supports this idea beyond your own immediate conjecture right now.

>It's not the only reason fair use exists, but it's the thing that allows e.g. search engines to exist, and that seems pretty important.

No, that's the transformative element of what a search engine provides. Search engines are not legal because they can't contact each licensor, they are legal because they are considered hugely transformative features.

>There are thousands of publishing houses and millions of self-published authors on top of that. Many books are also out of print or have unclear rights ownership.

Okay, and? How many customers does Microsoft bill on a monthly basis?

AnthonyMouse

> Show me anything that supports this idea beyond your own immediate conjecture right now

It's inherent in the nature of the test. The most important fair use factor is the effect on the market for the work, so if the use would be uneconomical without fair use then the effect on the market is negligible because the alternative would be that the use doesn't happen rather than that the author gets paid for it.

> No, that's the transformative element of what a search engine provides. Search engines are not legal because they can't contact each licensor, they are legal because they are considered hugely transformative features.

To make a search engine you have to do two things. One is to download a copy of the whole internet, the other is to create a search index. I'm talking about the first one, you're talking about the second one.

> Okay, and? How many customers does Microsoft bill on a monthly basis?

Microsoft does this with an automated system. There is no single automated system where you can get every book ever written, and separately interfacing with all of the many systems needed in order to do it is the source of the overhead.

const_cast

I think the notion that some sort of god-given right to "scale" can absolve you of laws is preposterous.

If your business model is not economically sustainable in the current legal landscape you operate in, the correct outcome is you go out of business.

There's lots and lots of potential businesses, infinite in fact, that fall into this understanding. They don't exist because they can't because we don't want them to, so you never see them. Which might give the impression of a right to scale, but no, it does not exist.

freejazz

>It's inherent in the nature of the test. The most important fair use factor is the effect on the market for the work, so if the use would be uneconomical without fair use then the effect on the market is negligible because the alternative would be that the use doesn't happen rather than that the author gets paid for it.

No, that's not the most important factor. The transformative factor is the most important. Effect on market for the work doesn't even support your argument anyway. Your argument is about the cost of making the end product, which is totally distinct from the market effects on the copyright holder when the infringer makes and releases the infringing product.

>To make a search engine you have to do two things. One is to download a copy of the whole internet, the other is to create a search index. I'm talking about the first one, you're talking about the second one.

So? That doesn't make you right. Go read the opinions, dude. This isn't something that's actually up for debate. Search engines are fair uses because of their transformative effect, not because they are really expensive otherwise. Your argument doesn't even make sense. By that logic, anything that's expensive becomes a fair use. It's facially ridiculous. Them being expensive is neither sufficient nor necessary for them to be a fair use. Their transformative nature is both sufficient and necessary to be found a fair use. Full stop.

>Microsoft does this with an automated system. There is no single automated system where you can get every book ever written, and separately interfacing with all of the many systems needed in order to do it is the source of the overhead.

Okay, and? They don't need to get every single book ever written. The libraries they pirated do not consist of "every single book ever written". It's hard to take this argument in good faith because you're being so ridiculous.

AnthonyMouse

> No, that's not the most important factor. The transformative factor is the most important.

It's a four factor test because all of the factors are relevant, but if the use has negligible effect on the market for the work then it's pretty hard to get anywhere with the others. For example, for cases like classroom use, even making verbatim copies of the entire work is often still fair use. Buying a separate copy for each student to use for only a few minutes would make that use uneconomical.

> Effect on market for the work doesn't even support your argument anyway. You're argument is about the cost of making the end product, which is totally distinct from the market effects on the copyright holder when the infringer makes and releases the infringing product.

We're talking about the temporary copies they make during training. Those aren't being distributed to anyone else.

> So? That doesn't make you right.

Making a copy of everything on the internet is a prerequisite to making a search engine. It's something you have to do as a step to making the index, which is the transformative step. Are you suggesting that doing the first step is illegal or what do you propose justifies it?

> By that logic, anything that's expensive becomes a fair use. It's facially ridiculous.

Anything with unreasonably high transaction costs. Why is that ridiculous? It doesn't exempt any of the normal stuff like an individual person buying an individual book.

> They don't need to get every single book ever written.

They need to get as many books as possible, with the platonic ideal being every book. Whether or not the ideal is feasible in practice, the question is whether it's socially beneficial to impose a situation with excessively high transaction costs in order to require something with only trivial benefit to authors (potentially selling one extra copy).

freejazz

>It's a four factor test because all of the factors are relevant, but if the use has negligible effect on the market for the work then it's pretty hard to get anywhere with the others. For example, for cases like classroom use, even making verbatim copies of the entire work is often still fair use. Buying a separate copy for each student to use for only a few minutes would make that use uneconomical.

All four factors are not equally relevant which is something described in pretty much every single fair use opinion. Educational uses are educational uses and considered fair because of their educational purpose (purpose is one of the factors), again, not because it's expensive. Maybe next time try googling or using ChatGPT "fair use educational".

>We're talking about the temporary copies they make during training. Those aren't being distributed to anyone else.

It's your argument. Not mine. You do not understand the market harm factor and it has nothing to do with Anthropic's transaction costs. That's just fully outright absolutely incorrect application of law.

>Making a copy of everything on the internet is a prerequisite to making a search engine. It's something you have to do as a step to making the index, which is the transformative step. Are you suggesting that doing the first step is illegal or what do you propose justifies it?

The transformative step is why it's a fair use, not the "market harm" (which you misunderstand) or the made up argument that it's "too expensive". In fact, I said this like every single turn in our conversation so it's a bit perplexing to me that you can now ask me "do you mean that it being transformative is what makes it legal" when that was my exact argument three times.

>Anything with unreasonably high transaction costs. Why is that ridiculous? It doesn't exempt any of the normal stuff like an individual person buying an individual book.

It's ridiculous because of the example I gave. Things being expensive is not a defense to copyright infringement and copyright law has no obligation to make expensive business models work. Copyright has an obligation to make transformative business models work because of the overall good they provide to society. Describing it as a "transaction cost" just kicks the can down the road even further and doesn't deal with the substance, either. They could have gone to the major publishers and licensed books from them. They didn't. That's generally who they are being sued by. When they are being sued by copyright owners in the fringe examples you pointed to, they will become relevant then.

>They need to get as many books as possible, with the platonic ideal being every book. Whether or not the ideal is feasible in practice, the question is whether it's socially beneficial to impose a situation with excessively high transaction costs in order to require something with only trivial benefit to authors (potentially selling one extra copy).

Lol dude, it was your example, not mine. They do not need every single book. They aren't being sued over every single book anyway, so it's totally besides the point.

zoklet-enjoyer

Did they really steal if they didn't deprive anyone of their copy? I don't think copying is theft.

hadlock

It's copyright infringement, which is not theft, they're legally distinct in the eyes of the law. This is partly why the "you wouldn't download a car" copyright ads were so widely mocked.

__MatrixMan__

Fun fact, they didn't have the rights to use the font they used for those commercials: https://news.ycombinator.com/item?id=43775926

gghffguhvc

Or the music. It was originally made as a one off for a film festival. Movie industry defended the lawsuit over the music.

axus

Agreed, the judge should avoid slang or even commonly accepted synonyms in an official ruling. The charge is not for theft.

Substitute infringement for theft.

fortran77

It's fine that you think that way. But this is a discusion of the laws of the United States of America and ruling by American courts, not a discussion of your own legal theories.

hnlmorg

The GP isn’t talking about some edge case legal dilemma that requires a lawyer or judge to comment. It’s already widely documented that copyright infringement is legally distinct from theft.

badlibrarian

"Tell it to the Judge..."

kjkjadksj

You may not think it is but the law does.

buildbot

The law says it’s copyright infringement, not theft.

dragonwriter

> So all they have to do is go and buy a copy of each book they pirated.

No, that doesn't undo the infringement. At most, that would mitigate actual damages, but actual damages aren't likely to be important, given that statutory damages are an alternative and are likely to dwarf actual damages. (It may also figure into how the court assigns statutory damages within the very large range available for those, but that range does not go down to $0.)

> They will have ceased and desisted.

"Cease and desist" is just to stop incurring additional liability. (A potential plaintiff may accept that as sufficient to not sue if a request is made and the potential defendant complies, because litigation is uncertain and expensive. But "cease and desist" doesn't undo wrongs and neutralize liability when they've already been sued over.)

rockemsockem

> So all they have to do is go and buy a copy of each book they pirated.

For anyone else who wants to do the same thing though this is likely all they need to do.

Cutting up and scanning books is hard work and actually doing the same thing digitally to ebooks isn't labor free either, especially when they have to be downloaded from random sites and cleaned from different formats. Torrenting a bunch of epubs and paying for individual books is probably cheaper

tzs

Generally you don't want laws to work that way. You want to set the penalties so that they discourage violating the law.

Setting the penalty to what it would have cost to obey the law in the first place does the opposite.

AnthonyMouse

That's for criminal laws where prosecutorial discretion can then (in principle) be used in borderline cases to prevent unjust outcomes.

If you give people a claim for damages which is an order of magnitude larger than their actual damages, it encourages litigiousness and becomes a vector for shakedowns because the excessive cost of losing pressures innocent defendants to settle even if there was a 90% chance they would have won.

Meanwhile both parties have the incentive to settle in civil cases when it's obvious who is going to win, because a settlement to pay the damages is cheaper than the cost of going to court and then having to pay the same damages anyway. Which also provides a deterrent to doing it to begin with, because even having to pay lawyers to negotiate a settlement is a cost you don't want to pay when it's clear that what you're doing is going to have that result.

And when the result isn't clear, penalizing the defendant in a case of first impression isn't just either, because it wasn't clear and punitive measures should be reserved for instances of unambiguous wrongdoing.

badlibrarian

Statutory damages were written into the first federal copyright law in 1790, and earlier in state law (specified in Pounds because the dollar hadn't been invented yet).

AnthonyMouse

> That is, he ruled that

> - buying, physically cutting up, physically digitizing books, and using them for training is fair use

> - pirating the books for their digital library is not fair use.

That seems inconsistent with one another. If it's fair use, how is it piracy?

It also seems pragmatically trash. It doesn't do the authors any good for the AI company to buy one copy of their book (and a used one at that), but it does make it much harder for smaller companies to compete with megacorps for AI stuff, so it's basically the stupidest of the plausible outcomes.

MrJohz

These are two separate actions that Anthropic did:

* They downloaded a massive online library of pirated books that someone else was distributing illegally. This was not fair use.

* They then digitised a bunch of books that they physically owned copies of. This was fair use.

This part of the ruling is pretty much existing law. If you have a physical book (or own a digital copy of a book), you can largely do what you like with it within the confines of your own home, including digitising it. But you are not allowed to distribute those digital copies to others, nor are you allowed to download other people's digital copies that you don't own the rights to.

The interesting part of this ruling is that once Anthropic had a legal digital copy of the books, they could use it for training their AI models and then release the AI models. According to the judge, this counts as fair use (assuming the digital copies were legally sourced).

AnthonyMouse

> This part of the ruling is pretty much existing law. If you have a physical book (or own a digital copy of a book), you can largely do what you like with it within the confines of your own home, including digitising it. But you are not allowed to distribute those digital copies to others, nor are you allowed to download other people's digital copies that you don't own the rights to.

Can you point me to the US Supreme Court case where this is existing law?

It's pretty clear that if you have a physical copy of a book, you can lend it to someone. It also seems pretty reasonable that the person borrowing it could make fair use of it, e.g. if you borrow a book from the library to write a book review and then quote an excerpt from it. So the only thing that's left is, what if you do the same thing over the internet?

Shouldn't we be able to distinguish this from the case where someone is distributing multiple copies of a work without authorization and the recipients are each making and keeping permanent copies of it?

MrJohz

I cannot point to the case, because my entire knowledge about the legality of this stuff comes from vaguely following the articles about this case. But feel free to read the judgement in this case where it will be spelled out in much more detail.

Also, I don't quite understand how your example is relevant to the case. If you give a book to a friend, they are now the owner of that book and can do what they like with it. If you photocopy that book and give them the photocopy, they are not the owner of the book and you have reproduced it without permission. The same is, I believe, true of digital copies - this is how ebook libraries work.

In this case, Anthropic were the legal owners of the physical books, and so could do what they wanted with them. They were not the legal owners of the digital books, which means they can get prosecuted for copyright infringement.

AnthonyMouse

> If you give a book to a friend, they are now the owner of that book and can do what they like with it.

We're talking about lending rather than ownership transfers, though of course you could regard lending as a sort of ownership transfer with an agreement to transfer it back later.

> If you photocopy that book and give them the photocopy, they are not the owner of the book and you have reproduced it without permission.

But then the question is whether the copy is fair use, not who the owner of the original copy was, right? For example, you can make a fair use photocopy of a page from a library book.

> They were not the legal owners of the digital books, which means they can get prosecuted for copyright infringement.

Even if the copy they make falls under fair use and the person who does own that copy of the book has no objection to their doing this?

MrJohz

You are talking about lending, but I'm not really sure why because it's not that relevant to the case.

If you photocopy a single page from a library book, this is often (but not always) fair use because you're copying only a limited part of the book. In the same way, you can quote a section or paragraph of a book under fair use. You cannot copy the whole book, though. Therefore:

> Even if the copy they make falls under fair use and the person who does own that copy of the book has no objection to their doing this?

If the copy had been made under fair use, then yes, this wouldn't be illegal. But it wasn't, because it was a reproduction and distribution of the entire book by someone who did not have the right to do that.

op00to

It is “established” law because the Copyright Act itself and a string of unanimous or near-unanimous appellate decisions (google ReDigi on digital transfers and Sony and the first-sale for personal use and physical lending) uniformly apply the same principles, leaving no circuit split and no conflicting precedent for the Supreme Court to resolve. In the U.S. system statutory text interpreted consistently by the Courts of Appeals becomes binding law nationwide unless and until the Supreme Court or Congress says otherwise.

AnthonyMouse

Sony v. Universal is a Supreme Court case, but that's the one where they say that sort of thing is fair use rather than that it isn't. ReDigi isn't a Supreme Court case, and it seems rather inconsistent with the Sony case which is. To claim uniformity you'd then need all the other circuit courts coming to the same conclusion rather than just not having had any relevant cases there yet, but is that the case?

freejazz

Do you think that Anthropic did not have the option of getting legal advice before they decided to pirate libraries of books for their own commercial purposes?

I understand that some of these things might be confusing to you, but Anthropic is absolutely within the position of being able to afford attorneys and get good advice as to what they could legally. I hope you also understand that good legal advice isn't being told what you want so you can do the thing you want to do without any regard for what are likely outcomes.

With that in mind, what do you think the inconsistency is between ReDigi and Sony?

cusaitech

The judge said they can train however I believe the judge did not make any ruling regarding model outputs

MrJohz

Thanks for the clarification!

icelancer

> You skipped quotes about the other important side:

He said:

> It was always somewhat obvious that pirating a library would be copyright infringement.

jasonlotito

From my understanding:

> pirating the books for their digital library is not fair use.

"Pirating" is a fuzzy word and has no real meaning. Specifically, I think this is the cruz:

> without adding new copies, creating new works, or redistributing existing copies

Essentially: downloading is fine, sharing/uploading up is not. Which makes sense. The assertion here is that Anthropic (from this line) did not distribute the files they downloaded.

AlotOfReading

The legal context here is that "format shifting" has not previously been held to be sufficient for fair use on its own, and downloading for personal use has also been considered infringing. Just look at the numerous media industry lawsuits against individuals that only mention downloading, not sharing for examples.

It's a bit surprising that you can suddenly download copyrighted materials for personal use and and it's kosher as long as you don't share them with others.

jasonlotito

> the numerous media industry lawsuits against individuals that only mention downloading,

I never saw any of these. All the cases I saw were related to people using torrents or other P2P software (which aren't just downloading). These might exist, but I haven't seen them.

> It's a bit surprising that you can suddenly download copyrighted materials for personal use and it's kosher as long as you don't share them with others.

Every click on a link is a risk of downloading copyrighted material you don't have the rights to.

Searching the internet, it appears that it's a civil infraction, but it's also confused with the notion that "piracy" is illegal, a term that's used for many different purposes. I see "It is illegal to download any music or movies that are copyrighted." under legal advice, which I know as a statement is not true.

Hence my confusion.

I should note: I'm not arguing from the perspective of whether it's morally or ethically right. Only that even in the context of this thread, things are phrased that aren't clear.

AlotOfReading

I just checked first individual suit I could find, which was BMG v. Gonzalez. She used P2P, but the case was specifically about her downloading, not redistributing.

travoc

Most P2P tools work in a way where you cannot download without simultaneously uploading.

AlotOfReading

Which is beside the point if the plaintiffs don't claim it as an issue. Take the anthropic opinion in the article, where the judge explicitly calls out that there's an unresolved question of whether the model outputs might be infringing that can't be ruled on because the plaintiffs only talk about the inputs.

Gonzalez is a ruling about downloading even though there was also distribution.

codedokode

Downloading and using pirated software in a company is fine then as long as it is not shared outside? If what you describe is legal it makes no sense to pay for software.

pyrale

sci-hub suddenly becomes legal if all researchers adhere to one big company, apparently.

After all, illegally downloading research papers in order to write new ones is highly transformative.

jasonlotito

> Downloading a document is fine as long as it is not shared outside?

I've fixed your question so that it accurately represents what I said and doesn't put words in my mouth.

If I click on a link and download a document, is that illegal?

I do not know if the person has the right to distribute it or not. IANAL, but when people were getting sued by the RIAA years back, it was never about downloading, but also distribution.

As I said, IANAL, but feel free to correct me, but my understanding is that downloading a document from the internet is not illegal.

CaptainFever

> it was never about downloading, but also distribution.

Did you mean to write "but about distribution" here?

jasonlotito

Yes, thank you for catching that. Unfortunately, I cannot edit it now.

eikenberry

Given that downloading requires you to copy the data to download it, I'd think it would fall under "adding new copies".

jasonlotito

> All Anthropic did was replace the print copies it had purchased ... with more convenient space-saving and searchable digital copies for its central library — without adding new copies..."

That suggests otherwise.

jpalawaga

I don't think that's new. google set precedent for that more than a decade ago. you're allowed to transform a book to digital.

alok-g

AFAIK, Judge Vince Chhabria has countered that Fair Use argument in a later order involving Meta.

https://www.courtlistener.com/docket/67569326/598/kadrey-v-m...

Note: I am not a lawyer.

seuraughty

Feels like information laundering to me.

franczesko

Is fruit of the poisonous tree rule applicable here?

gruez

That's only really applicable to evidence in criminal cases obtained by the government. No such doctrine exists for civil cases, for instance. It doesn't even bar the government from using evidence that others have collected illegally of their own volition.

MaxPock

How times change .They wanted to lock up Aaron Schwartz for life for essentially doing the same thing Anthropic is doing.

m4x

Aaron Swartz wanted to provide the public with open access to paywalled journal articles, while Anthropic want to use other people's copyrighted material to train their own private models that they restrict access to via a paywall. It's wild (but unsurprising) that Aaron Swartz was prosecuted under the CFAA for this while Anthropic is allowed to become commercially successful

sershe

Im not sure how I feel about what anthropic did on merit as a matter of scale, but from a legalistic standpoint how is it different from using the book to train the meat model in my head? I could even learn bits by heart and quote them in context.

melagonster

Machines do not have rights belonging to human now.

bgwalter

Here is how individuals are treated for massive copyright infringement:

https://investors.autodesk.com/news-releases/news-release-de...

piker

I thought you'd go with this: https://en.wikipedia.org/wiki/United_States_v._Swartz

dialup_sounds

Swartz wasn't charged with copyright infringement.

natch

*technically

kube-system

If you're discussing law, an entirely different law in a different title of US code is more than a technicality.

piker

No, the parent was referring to how someone “was treated”, and it would have been perfectly valid to reference that case to make the same point.

What you’re saying is like calling Al Capone a tax cheat. Nonsense.

They went after Aaron over copyright.

dialup_sounds

Unlike much of the post hoc hagiography around Swartz, it's literally true.

arandomhuman

No but he coincidentally passed away after he was accused of it.

kube-system

No, the CFAA was the law that had him facing 35 years in prison and $1m+ fines. It wasn't a copyright case.

tzs

He wasn't facing anywhere near that. When the DOJ charges someone with a set of charges they like to say in the press release that the person is facing N years, where they get N by simply adding up the maximums for each charge that it is possible for a hypothetical defendant that has all the possible sentence enhancing factors to get. They also ignore that some charges group for sentencing--your sentence for the group is the maximum sentence for the individual charges in the group.

Here's an article explaining in more detail [1].

Most experts say that if Swartz had gone to trial and the prosecution had proved everything they alleged and the judge had decided to make an example of Swartz and sentence harshly it would have been around 7 years.

Swartz's own attorney said that if they had gone to trail and lost he thought it was unlikely that Swartz would get any jail time.

Swartz also had at least two plea bargain offers available. One was for a guilty plea and 4 months. The other was for a guilty plea and the prosecutors would ask for 6 months but Swartz could ask the judge for less or for probation instead and the judge would pick.

[1] https://www.popehat.com/2013/02/05/crime-whale-sushi-sentenc...

kube-system

Yes, I meant "up to" that amount, which is implied when many people say "facing" before a trial happens. But it's not really relevant to my point, which was that it wasn't a copyright case.

Aurornis

> Here is how individuals are treated for massive copyright infringement:

When I clicked the link, I got an article about a business that was selling millions of dollars of pirated software.

This guy made millions of dollars in profit by selling pirated software. This wasn't a case of transformative works, nor of an individual doing something for themselves. He was plainly stealing and reselling something.

stocksinsmocks

Anthropic isn’t selling copies of the material to its users though. I would think you couldn’t lock someone up for reading a book and summarizing or reciting portions of the contents.

Seven years for thumbing your nose at Autodesk when armed robbery would get you less time says some interesting things about the state of legal practice.

wmeredith

> summarizing or reciting portions of the contents

This absolutely falls under copyright law as I understand it (not a lawyer). E.g. the disclaimer that rolls before every NFL broadcast. The notice states that the broadcast is copyrighted and any unauthorized use, including pictures, descriptions, or accounts of the game, is prohibited. There is wiggle room for fair use by news organizations, critics, artists, etc.

tpmoney

They might say that, but it doesn’t mean it has the force of law behind it. Copyright does not cover and has never covered facts. So as much as the NFL might wish you can’t tell people what the final score of the game is, or describe the events of the last minute clutch play, they can’t actually prevent you from doing that because that’s not protected by copyright.

steveklabnik

I can say "you cannot read this comment for any purpose" but that doesn't supersede the law.

Buttons840

Btu it is ilelgal to rveerse enigneer tihs porprietary encrpytion algroithm I cerated and uesd to encrpyt tihs mesasge.

burnt-resistor

I'm wondering though how the law will construe AI able to make a believable sequel to Moby Dick after digesting Herman Melville's works. (Or replace Melville with a modern writer.)

markhahn

existing copyright law seems to say you cannot help yourself to significant parts of a work - such as to write your own sequel. I have no idea how the courts establish the degree of copying "owned" by the original author. There would clearly be stories in some way related to Moby Dick that would be legal, but others that were too close.

zahma

Except they aren’t merely reading and reciting content, are they? That’s a rather disingenuous argument to make. All these AI companies are high on billions in investment and think they can run roughshod over all rules in the sprint towards monetizing their services.

Make no mistake, they’re seeking to exploit the contents of that material for profits that are orders of magnitude larger than what any shady pirated-material reseller would make. The world looks the other way because these companies are “visionary” and “transformational.”

Maybe they are, and maybe they should even have a right to these buried works, but what gives them the right to rip up the rule book and (in all likelihood) suffer no repercussions in an act tantamount to grand theft?

There’s certainly an argument to be had about whether this form of research and training is a moral good and beneficial to society. My first impression is that the companies are too opaque in how they use and retain these files, albeit for some legitimate reasons, but nevertheless the archival achievements are hidden from the public, so all that’s left is profit for the company on the backs of all these other authors.

JimDabell

> illegally copying and selling pirated software

This is very different to what Anthropic did. Nobody was buying copies of books from Anthropic instead of the copyright holder.

armada651

I wouldn't be so sure about that statement, no one has ruled on the output of Anthropic's AI yet. If their AI spits out the original copy of the book then it is practically the same as buying a book from them instead of the copyright holder.

We've only dealt with the fairly straight-forward legal questions so far. This legal battle is still far from being settled.

cmiles74

It’s very unlikely that Claude will verbatim reproduce an entire book from its training corpus. If that’s the bar, they are pretty safe in my opinion.

KoolKat23

It is extremely likely this will be declared fair use in the end.

There's already one decision on a competitor.

It makes sense, if you think of how the model works.

JimDabell

> If their AI spits out the original copy of the book

Not even the authors suing Anthropic have claimed it can do this, have they?

rvnx

At the very least, they should have purchased the originals once

arandomhuman

Yeah, people have gone to jail for a few copies of content. Taking that large of a corpus and getting off without penalty would be a farce of the justice system.

rockemsockem

Bad decisions should not be repeated in the name of fair application.

impossiblefork

They actually should, because generally an equal playing field is more important that correct law.

As an extreme example, consider murder. Obviously it should be illegal, but if it's legal for one group and not for another, the group for which it's illegal will probably be wiped out, having lost the ability to avenge deaths in the group.

It's much more important that laws are applied impartially and equally than that they are even a tiny bit reasonable.

rockemsockem

You're assuming discrimination on the basis of groups. That seems bad to me.

Laws and their enforcement are a clusterfuck. To achieve greater justice we should strive towards better judgements overall.

God, stop with the group on group bs please and engage with things the way they're written without injecting the entirety of your cynical worldview layered on top.

impossiblefork

I don't assume discrimination on the basis of group affiliation. I give it as an example of why it is much more important that the law is applied consistently than that it is sound.

Furthermore, group affiliation based differences in judicial decisions are very common, both when it comes to ethnic origin, wealth and profession.

In this case group affiliation is also directly relevant: individuals who have infringed copyright are typically not treated in the way that these firms that have infringed copyright are. The group affiliation in question is thus 'are you an employer/wealth person owning part of a large firm' vs 'a normal, non-employer/non-wealthy person'.

haneefmubarak

I think GP's point is that you should always seek to apply the law correctly, hopefully setting precedent for its correct application for everyone in the future.

impossiblefork

They should apply it consistently.

If there are precedents with a certain application, then they must continue or be overturned generally.

Correct but uneven application of a law is more dangerous than incorrect but even application.

rockemsockem

Yep, exactly

ysofunny

before breaking the law, set up a corporation to absorb the liability!

in other words, provided you have enough spare capital to spin up a corporation, you can break the law!!!!

nh23423fefe

What point are you making? 20 years ago, someone sold pirated copies of software (wheres the transformation here) and that's the same as using books in a training set? Judge already said reading isnt infringement.

This is reaching at best.

amlib

Aren't you comparing the wrong things? First example is about the output/outcome, what is the equivalent for LLMs? Also, not all "pirated" things are sold, most are in fact distributed for free.

"Pirates" also transform the works they distribute. They crack it, translate it, compress it to decrease download times, remove unnecessary things, make it easier to download by splitting it in chunks (essential with dial-up, less so nowadays), change distribution formats, offer it trough different channels, bundle extra software and media that they themselves might have coded like trainers, installers, sick chiptunes and so on. Why is the "transformation" done by a big corpo more legal in your views?

nh23423fefe

its legal because a judge said so.

your piracy apologetics are obviously irrelevant.

farceSpherule

Peterson was copying and selling pirated software.

Come up with a better comparison.

organsnyder

Anthropic is selling a service that incorporates these pirated works.

adolph

That a service incorporating the authors' works exists is not at issue. The plaintiffs' claims are, as summarized by Alsup:

  First, Authors argue that using works to train Claude’s underlying LLMs 
  was like using works to train any person to read and write, so Authors 
  should be able to exclude Anthropic from this use (Opp. 16). 

  Second, to that last point, Authors further argue that the training was 
  intended to memorize their works’ creative elements — not just their 
  works’ non-protectable ones (Opp. 17).

  Third, Authors next argue that computers nonetheless should not be 
  allowed to do what people do.

https://media.npr.org/assets/artslife/arts/2025/order.pdf

TeMPOraL

The first paragraph sounds absurd, so I looked into the PDF, and here's the full version I found:

> First, Authors argue that using works to train Claude’s underlying LLMs was like using works to train any person to read and write, so Authors should be able to exclude Anthropic from this use (Opp. 16). But Authors cannot rightly exclude anyone from using their works for training or learning as such. Everyone reads texts, too, then writes new texts. They may need to pay for getting their hands on a text in the first instance. But to make anyone pay specifically for the use of a book each time they read it, each time they recall it from memory, each time they later draw upon it when writing new things in new ways would be unthinkable. For centuries, we have read and re-read books. We have admired, memorized, and internalized their sweeping themes, their substantive points, and their stylistic solutions to recurring writing problems.

Couldn't have put it better myself (though $deity knows I tried many times on HN). Glad to see Judge Alsup continues to be the voice of common sense in legal matters around technology.

losvedir

> Glad to see Judge Alsup continues to be the voice of common sense in legal matters around technology

Yep, that name's a blast from the past! He was the judge on the big Google/Oracle case about Android and Java years ago, IIRC. I think he even learned to write some Java so he could better understand the case.

cmiles74

For everyone arguing that there’s no harm in anthropomorphizing an LLM, witness this rationalization. They talk about training and learning as if this is somehow comparable to human activities. The idea that LLM training is comparable to a person learning seems way out there to me.

“We have admired, memorized, and internalized their sweeping themes, their substantive points, and their stylistic solutions to recurring writing problems.”

Claude is not doing any of these things. There is no admiration, no internalizing of sweeping themes. There’s a network encoding data.

We’re talking about a machine that accepts content and then produces more content. It’s not a person, it’s owned by a corporation that earns money on literally every word this machine produces. If it didn’t have this large corpus of input data (copyrighted works) it could not produce the output data for which people are willing to pay money. This all happens at a scale no individual could achieve because, as we know, it is a machine.

ben_w

There may be no admiration, but there definitely is an internalising of sweeping themes, and all the other things in your quotation, which anyone can fetch by asking it for the themes/substantive points/stylistic solutions of one of the books it has (for lack of a better verb) read.

That the mechanism performing these things is a network encoding data is… well, that description, at that level of abstraction, is a similarity with the way a human does it, not even a difference.

My network is a 3D mess made of pointy bi-lipid bags exchanging protons across gaps moderated by the presence of neurochemicals, rather than flat sheets of silicon exchanging electrons across tuned energy band-gaps moderated by other electrons, but it's still a network.

> We’re talking about a machine that accepts content and then produces more content. It’s not a person, it’s owned by a corporation that earns money on literally every word this machine produces. If it didn’t have this large corpus of input data (copyrighted works) it could not produce the output data for which people are willing to pay money. This all happens at a scale no individual could achieve because, as we know, it is a machine.

My brain is a machine that accepts content in the form of job offers and JIRA tickets (amongst other things), and then produces more content in the form of pull requests (amongst other things). For the sake specifically of this question, do the other things make a difference? While I count as a person and am not owned by any corporation, when I work for one, they do earn money on the words this biological machine produces. (And given all the models which are free to use, the LLMs definitely don't earn money on "literally" every word those models produce). If I didn't have the large corpus of input data — and there absolutely was copyright on a lot of the school textbooks and the TV broadcast educational content of the 80s and 90s when I was at school, and the Java programming language that formed the backbone of my university degree — I could not produce the output data for which people are willing to pay money.

Should corporations who hire me be required to pay Oracle every time I remember and use a solution that I learned from a Java course, even when I'm not writing Java?

That the LLMs do this at a scale no individual could achieve because it is a machine, means it's got the potential to wipe me out economically. Economics threat of automation has been a real issue at least since the luddites if not earlier, and I don't know how the dice will fall this time around, so even though I have one layer of backup plan, I am well aware it may not work, and if it doesn't then government action will have to happen because a lot of other people will be in trouble before trouble gets to me (and recent history shows that this doesn't mean "there won't be trouble").

Good luck to us all.

xdennis

> That a service incorporating the authors' works exists is not at issue.

It's not an issue because it's not currently illegal because nobody could have foreseen this years ago.

But it is profiting off of the unpaid work of millions. And there's very little chance of change because it's so hard to pass new protection laws when you're not Disney.

TeMPOraL

It's not an issue because it's not what this case was about, as the linked document explicitly states. The Authors did not contest the legality of the model's outputs, only the inputs used in training.

megaman821

Correct, the New York Times and Disney are suing for the output side. I am going to hazard a guess that you won't be able to circumvent copyright and trademark just because you are using AI. Where that line is has yet to be determined though.

TeMPOraL

Right, but where that line will be drawn will have major impact on the near-term future of those models. If the user is liable for distributing infringing output that came from AI, that's not a problem for the field (and IMHO a reasonable approach) - but if they succeed in making the model vendors liable for the possibility of users generating infringing output, it'll shake things up pretty seriously.

CaptainFever

Let's not expand copyright law.

adolph

Marx wrote The tradition of all dead generations weighs like an Alp on the brains of the living. and that would be true if one were obligated to pay the full freight of one's antecedents. The more positive truth is that the brains of the living reach new heights from that Alp and build ever new heights for those who come afterwards.

codedokode

Computers cannot learn and are not subjects to laws. What happens, is a human takes a copyrighted work, makes an unauthorized digital copy, and loads it into a computer without authorization from copyright owner.

KoolKat23

And they are not selling this or distributing this.

The model is very different.

cmiles74

I have to disagree, without all the copyrighted input data there would be no output data for these companies to sell. This output data is the product and they are distributing it for dollars.

KoolKat23

Copyright is concerned with the the actual physical copy. The model isn't this. The end user would have to carefully prompt the models algorithm to output a copyright infringing piece.

This argument is more along the lines of: blaming Microsoft Word for someone typing characters into the word processors algorithm, and outputting a copy of an existing book. (Yes, it is a lot easier, but the rationale is the same). In my mind the end user prompting the model would be the one potentially infringing.

cmiles74

FWIW, I don’t think there is a prompt that would reliably produce, verbatim, a copyrighted work.

I do think that a big part of the reason Anthropic downloaded millions of books from pirate torrents was because they needed that input data in order to generate the output, their product.

I don’t know what that is, but, IMHO, not sharing those dollars with the creators of the content is clearly wrong.

adolph

It can't be "unauthorized" if no authorization was needed.

lawlessone

> underlying LLMs was like using works to train any person to read and write

I don't think humans learn via backprop or in rounds/batches, our learning is more "online".

If I input text into an LLM it doesn't learn from that unless the creators consciously include that data in the next round of teaching their model.

Humans also don't require samples of every text in history to learn to read and write well.

Hunter S Thompson didn't need to ingest the Harry Potter books to write.

chourobin

asadotzler

piracy isn't a thing, except on the high seas. what you're thinking about is copyright violation.

downrightmike

Yup, piracy sounds better than copyright violation.

“Piracy” is mostly a rhetorical term in the context of copyright. Legally, it’s still called infringement or unauthorized copying. But industries and lobbying groups (e.g., RIAA, MPAA) have favored “piracy” for its emotional weight.

collingreen

Emotional weight or because it's intentionally misleading.

admissionsguy

Does piracy have negative connotations? I thought everyone thought pirates were cool

accrual

Everyone but the person(s) affected by the pirates, I suppose.

downrightmike

Really that is just the hollywood version of pirates

achierius

Can you explain why? What makes them categorically different or at the very least why is "piracy" quantitatively worse than 'just' copyright violation?

arrosenberg

Piracy is theft - you have taken something and deprived the original owner of it.

Copyright infringement is unauthorized reproduction - you have made a copy of something, but you have not deprived the original owner of it. At most, you denied them revenue although generally less than the offended party claims, since not all instances of copying would have otherwise resulted in a sale.

ddingus

Yes, and the struggle with this back in the day was the *IAA and related organizations wanted to equate infringement with theft.

And to be clear, we javelin the word infringement precisely because it is not theft.

In addition to the deprived revenue, piracy also improves on the general relevance the author has or may have in the public sphere. Essentially, one of the side effects of piracy is basically advertising.

Doctorow was one of the early ones to bring this aspect of it up.

fuzzfactor

I have about the same concept of piracy these days.

Real piracy always involves booty.

Naturally booty is wealth that has been hoarded.

Has nothing to do with wealth that may or may not come in the future, regardless of whether any losses due to piracy have taken place already or not.

abeppu

Maybe the most memorable version of the response is this the "Copying is not Theft" song. https://www.youtube.com/watch?v=IeTybKL1pM4

NoMoreNicksLeft

Asked unironically: "What's worse, hijacking ships at sea and holding their crews hostage for ransom on threat of death, or downloading a song off the internet?" ...

charcircuit

Saying that piracy isn't copyright violation is an RMS talking point. It's not worth trying to ask why because the answer will be RMS said so and will not be backed by the common usage of the word.

lcnPylGDnU4H9OF

> RMS

Referring to this? (Wikipedia's disambiguation page doesn't seem to have a more likely article.)

https://en.wikipedia.org/wiki/Richard_Stallman#Copyright_red...

charcircuit

Yes, quoting the following section:

    Stallman places great importance on the words and labels people use to talk about the world, including the relationship between software and freedom. He asks people to say free software and GNU/Linux, and to avoid the terms intellectual property and piracy (in relation to copying not approved by the publisher). One of his criteria for giving an interview to a journalist is that the journalist agrees to use his terminology throughout the article.

lcnPylGDnU4H9OF

That seems rather agreeable, though. Stallman is essentially saying that words are meaningful and speakers/writers should be thoughtful about the meaning of the words they use. In that context, refusing to use terms like "intellectual property" and "piracy" because of their meaning and the effect their use has on culture, and especially insisting that journalists who interview you use the same language, seems to be a means of controlling the interpreted meaning of one's expressions.

(As an aside, it seems pointless to decry it as a "talking point". The reason it was brought up is presumably because the author agrees with it and thinks it's relevant. It's also entirely possible that the author, like me, made this argument without being aware that it was popularized by Richard Stallman. If it makes sense then you can hear the argument without hearing the person and still find it agreeable.)

"Piracy" is used to refer to copyright violation to make it sound scary and dangerous to people who don't know better or otherwise don't think about it too hard. Just imagine if they called it "banditry" instead; now tell me that pirates are not bandits with boats. They may as well have called it banditry and it's worth correcting that. (I also think it's worth ridiculing but that doesn't appear to be Stallman's primary point.) It's not banditry (how ridiculous would it be to call it that?), it's copyright infringement.

Edit:

Reading my comment again in the context of other things you wrote, I suspect the argument will not pass muster because you do not seem to see piracy's change in meaning as manufactured by PR work purchased by media industry leaders. I'm not really trying to convince you that it's true but it may be worth considering that it is the fundamental disagreement you seem to have with others on Stallman's point; again, not saying you're wrong, just that's where the disagreement is.

charcircuit

My point is that the 2 commenters are working off of different definitions. One is using the common definitions of words in English and the other is trying to advocate for their ideological rooted definitions by trying to correct people who use the normal English definitions. 99% of the time how this will play out is the idealog will preach about their values instead of acknowledging that they are purposefully using different definitions.

In short the post is bait.

lcnPylGDnU4H9OF

> In short the post is bait.

This is an uncharitable interpretation. The ostensible point of the comment, or at least a stronger and still-reasonable interpretation, is that they are trying to point out that this specific word choice confuses concepts, which it does. Richard Stallman and the commenter in question are absolutely correct to point that out. You actually seem to be agreeing with Stallman, at least in the abstract.

It's should be acknowledged how/why the meaning of the word changed. As I said, that seems to have been manufactured, which suggests, at least to me, that their (and Richard Stallman's) point is essentially the same as yours. That is to say, the US media industry started paying PR firms to use "piracy" as meaning something other than its normal definition until that became the common definition.

They should not purposely use a different definition like that. That is Stallman's point, and why he refuses to say "piracy" instead of "copyright infringement"; ocean banditry is not copyright infringement and it is confusing -- intentionally so -- to say that it is.

buzzerbetrayed

You legitimately have it completely backwards. The word "piracy" was coopted to put a more severe spin on copyright violation. As a result, it became "the common usage of the word". But that was by design. And it's worth pushing back on.

carlhjerpe

Sweden has a political party called "The Pirate Party"(1), and "The Pirate Bay" is Swedish so I think a couple of Swedes memeing before it was cool has a significant impact on making the name stick but also taking the seriousness out of it.

1: https://piratpartiet.se/en/

charcircuit

I don't have it backwards. Language evolved, and piracy got a new definition. It's even in the dictionary. Trying to redefine words like this is futile and avoiding certain words or replacing them with others is a quirk that RMS has.

marapuru

Apparently it's a common business practice. Spotify (even though I can't find any proof) seems to have build their software and business on pirated music. There is some more in this Article [0].

https://torrentfreak.com/spotifys-beta-used-pirate-mp3-files...

Funky quote:

> Rumors that early versions of Spotify used ‘pirate’ MP3s have been floating around the Internet for years. People who had access to the service in the beginning later reported downloading tracks that contained ‘Scene’ labeling, tags, and formats, which are the tell-tale signs that content hadn’t been obtained officially.

techjamie

Crunchyroll was originally an anime piracy site that went legit and started actually licensing content later. They started in mid-2006, got VC funding in 2008, then made their first licensing deal in 2009.

https://www.forbes.com/2009/08/04/online-anime-video-technol...

https://venturebeat.com/business/crunchyroll-for-pirated-ani...

Cyph0n

Yep, they were huge too - virtually anyone who watched free anime back then would have known about them.

My theory is that once they saw how much traffic they were getting, they realized how big of a market (subbed/dubbed) anime was.

haiku2077

Good Old Games started out with the founders selling pirated games on disc at local markets.

techjamie

Pirated games translated to Polish if possible, because game devs weren't catering to the market with translations, and Poland didn't respect foreign copyright.

Shank

And now Crunchyroll is owned by (through a lot of companies, like Aniplex of America, Aniplex, A1 Pictures) Sony, who produces a large amount of anime!

dathinab

not just Spotify pretty much any (most?) current tech giant was build by

- riding a wave of change

- not caring too much about legal constraints (or like they would say now "distrupting" the market, which very very often means doing illigal shit which beings them far more money then any penalties they will ever face from it)

- or caring about ethics too much

- and for recent years (starting with Amazone) a lot of technically illegal financing (technically undercutting competitors prices long term based on money from else where (e.g. investors) is unfair competitive advantage (theoretically) clearly not allowed by anti monopoly laws. And before you often still had other monopoly issues (e.g. see wintel)

So yes not systematic not complying with law to get unfair competitive advantage knowing that many of the laws are on the larger picture toothless when applied to huge companies is bread and butter work of US tech giants

benced

As you point out, they mostly did this before they were large companies (where the public choice questions are less problematic). Seems like the breaking of these laws was good for everybody.

dathinab

they where already big when they systematically broke this laws

breaking this laws is what lifted them from big, to supper marked dominant to a point where they have monopoly like power

that is _never_ good for everyone, or even good for the majority long term

what is good for everyone (but a few rich people and sometimes the US government) is proper fair competition. It drives down prices and allows people to vote with their money, a it is a corner stone of the American dream it pushes innovation and makes sure a country isn't left behind. Monopoly like companies on the other hand tend to have exactly the other effect, higher prices (long term), corruption, stagnating innovation, and a completely shattered American sound pretty bad for the majority of Americans.

FirmwareBurner

>Seems like the breaking of these laws was good for everybody.

Are all music creators better off now than before Spotify?

megaman821

The music pie is bigger now but it is split between more people. Spotify brings in the most revenue for musicians as a whole.

dathinab

Yeah, but like another post said it killed a lot of other income streams.

And Spotify is a bad example as it ran into another psudo monopoly with very unreasonable/unhealthy power (the few large music labels holding rights to the majority of main stream music).

They pretty much forced very bad terms onto Spotify which is to some degree why Spotify is pushing podcasts, as they can't be long term profitable with Music (raising prices doesn't help if the issue is a percent cut which rises too :/ )

oblio

Is that why the biggest source of income for musicians these days are live shows? Streaming basically killed recording income for 99.9999% of musicians.

pjc50

"recording obtained unofficially" and "doesn't have rights to the recording" are separate things. So they could well have got a license to stream a publisher's music but that didn't come with an actual copy of some/all of the music.

pembrook

It wasn’t just the content being pirated, but the early Spotify UI was actually a 1:1 copy of Limewire.

KoolKat23

There's plenty of startups gone legitimate.

Society underestimates the chasm that exists between an idea and raising sufficient capital to act on those ideas.

Plenty of people have ideas.

We only really see those that successfully cross it.

Small things EULA breaches, consumer licenses being used commercially for example.

hinterlands

The problem is that these "small things" are not necessarily small if you're an individual.

If you're an individual pirating software or media, then from the rights owners' perspective, the most rational thing to do is to make an example of you. It doesn't happen everyday, but it does happen and it can destroy lives.

If you're a corporation doing the same, the calculation is different. If you're small but growing, future revenues are worth more than the money that can be extracted out of you right now, so you might get a legal nastygram with an offer of a reasonable payment to bring you into compliance. And if you're already big enough to be scary, litigation might be just too expensive to the other side even if you answer the letter with "lol, get lost".

Even in the worst case - if Anthropic loses and the company is fined or even shuttered (unlikely) - the people who participated in it are not going to be personally liable and they've in all likelihood already profited immensely.

KoolKat23

I agree, that was the point I was trying to make. It seems small but until the business is up and running at sufficient scale, the costs can be insurmountable.

And the system set up by society doesn't truly account for this or care.

dathinab

but it's not some small things

but systematic wide spread big things and often many of them, giving US giant a unfair combative advantage

and don't think if you are a EU company you can do the same in the US, nop nop

but naturally the US insist that US companies can do that in the EU and complain every time a US company is fined for not complying for EU law

jowea

Uber

Barrin92

>Society underestimates the chasm that exists between an idea and raising sufficient capital to act on those ideas.

The AI sector, famously known for its inability to raise funding. Anthropic has in the last four years raised 17 billion dollars

KoolKat23

Only once chatgpt 3.5 was released...

Other industries do not have it this easy.

pyman

There's no credible evidence Spotify built their company and business on pirated music.

This is a narrative that gets passed around in certain circles to justify stealing content.

YPPH

"Stealing" isn't an apt term here. Stealing a thing permanently deprives the owner of the thing. What you're describing is copyright infringement, not stealing.

In this context, stealing is often used as a pejorative term to make piracy sound worse than it is. Except for mass distribution, piracy is often regarded as a civil wrong, and not a crime.

bumby

I think you make a good point, but there is some irony in pointing out the distinction between colloquial and legal use of the term “stealing” while also misusing the term “piracy” to describe legal matters.

It would be more clear if you stick to either legal or colloquial variants, instead of switching back and forth. (Tbf, the judge in this case also used the term “piracy” colloquially).

YPPH

I'm not sure that's irony. Just an plain error on my part, which you're right to point out. Piracy is also a pejorative label for copyright infringement.

bumby

Crashing your car is an error. Crashing your car while lecturing someone else about safe driving is irony.

In this case it seemed like you were making a point about the strict legal sense of a word, but misusing a different legal term to do so.

YPPH

You're describing hypocrisy.

bumby

Irony is the opposite of what you’d expect, or incongruity between what is expected and what occurs. I expect someone lecturing on safe driving to…well, know how to drive safely.

Hypocrisy requires deception, not ignorance.

KoolKat23

Best/most succinct explanation I've seen to date.

pyman

Pirating a book and selling it on claude.ai is stealing, both legally and morally.

Pirating 7 million books, remixing their content, and using that to make money on Claude.ai is like counterfeiting 7 million branded products and selling them on your Shopify website. The original creators don't get payment, and someone's profiting off their work.

Try doing that yourself and you'd get a knock on the door real quick.

Paradigma11

There are tests that determine if a work infringes on the copyright of another. That is well established law. Just use that test and show that this work is infringing on that work. If you cant it doesn't.

KoolKat23

Properly remixing the content so that it can be considered distinct would be fair use. You can't copyright a style, concept or idea.

Also mostly this would be a civil lawsuit for "damages".

pyman

It might be legal in the US, but not in the rest of the world.

The trial is scheduled for December 2025. That’s when a jury will decide how much Anthropic owes for copying and storing over seven million pirated books

CaptainFever

Actually, "the rest of the world" has already legalised AI training in the form of Text and Data Mining Exemption laws.

fuzzfactor

You make some good points, this really is going to take some careful judgment and chances are it's too complex for an actual courtroom to yield an ideal outcome.

Now places like Flea markets have been known to have a counterfeit DVD or two.

And there is more than one way to compare to non-digital content.

Regular books and periodicals can be sold out and/or out-of-print, but digital versions do not have these same exact limitations.

A great deal of the time though, just the opposite occurs, and a surplus is printed that no one will ever read, and which will eventually be disposed of.

Newspapers are mainly in the extreme category where almost always a significant number of surplus copies are intentionally printed.

It's all part of the same publication, a huge portion of which no one has ever rightfully expected for every copy to earn anything at all, much less a return on every single copy making it back to the original creator.

Which is one reason why so much material is supported by ads. Even if you didn't pay a high enough price to cover the cost of printing, it was all paid for well before it got into your hands.

Digital copies which are going unread are something like that kind of surplus. If you save it from the bin you should be able to do whatever you want with it either way, scan it how you see fit.

You just can't say you wrote it. That's what copyright is supposed to be for.

Like at the flea market, when two different vendors are selling the same items but one has legitimately purchased them wholesale and the other vendor obtained theirs as the spoils of a stolen 18-wheeler.

How do you know which ones are the pirated items?

You can tell because the original owners of the pirated cargo suffered a definite loss, and have none of it remaining any more.

OTOH, with things like fake Nikes at the flea market, you can be confident they are counterfeit whether they were stolen from anybody in any way or not.

bumby

>If you save it from the bin you should be able to do whatever you want with it either way, scan it how you see fit.

Don’t we already have laws covering this? For example, sometimes excess books can be thrown in the bin. Often, they have the covers removed. Some will say something to the effect that “if you’ve received this without a cover it is a copyright violation.” I think one of the points of the lawsuit is it gives copyright holders discretion as to how their works are used/sold etc. The idea that “if you saved it from the bin you can do with it whatever you want” strips them of that right.

fuzzfactor

Another good point, and it's a fine point as well.

You could split hairs over whether saving an item from the bin occurred after a procedure to remove covers and it was already dumped, or before any contemplation was made about if or when dumping would take place.

Saving either way would be preserving what would otherwise be lost, even if it was well premeditated in advance of any imminent risk.

What if it was the last remaining copy?

Or even the only copy ever in existence of an original manuscript?

It's just not a concept suitable for a black & white judgment.

That's a very good sign that probably an entire book of regulations needs to be thrown out instead, and a new law written to replace it with something more sensible.

bumby

>What if it was the last remaining copy?

Or even the only copy ever in existence of an original manuscript?

I think these still remove the copyright of the author. As it stands, I have the right to write the best novel about the human condition ever conceived and also the right (if copyrighted) to not allow anyone to read it. I can light it on fire if I wish. I am not obligated to sell it to anyone. In the context of the above, I can stipulate that nobody can distribute excess copies even if they would be otherwise destroyed. You may think that’s wasteful or irrational but we have all kinds of rights that protect our ability to do irrational things with our own property.

>That's a very good sign that probably an entire book of regulations needs to be thrown out instead, and a new law written to replace it with something more sensible.

This sentiment implies that you do not think the owner has those rights. That’s fine, but there are plenty of people (myself included) who think those are reasonable rights. Intellectual property clause is in the first article of the US Constitution for a good reason, although I do think it can be abused.

ungreased0675

There seems to be an unwritten rule for VC-backed tech companies, that if a law is broken at massive scale and very quickly, it’s ok. It’s the fait accompli strategy many of the large tech companies used to get where they are.

Don’t have legal access to training data? Simply steal it, but move fast enough to keep ahead of the law. By the time lawsuits hit the company is worth billions and the product is embedded in everyday life.

lmm

> There's no credible evidence Spotify built their company and business on pirated music.

That's a statement carefully crafted to be impossible to disprove. Of course they shipped pirated music (I've seen the files). Of course anyone paying attention knew. Nothing in the music industry was "clean" in those days. But, sure, no credible evidence because any evidence anyone shows you you'll decide is not credible. It's not in anyone's interests to say anything and none of it matters.

cmiles74

Google Music originally let people upload their own digital music files. The argument at the time was that whether or not the files were legally obtained was not Google’s problem. I believe Amazon had a similar service.

https://www.computerworld.com/article/1447323/google-reporte...

leviathant

YouTube's initial success came from being able to serve, on a global scale, user-uploaded, largely uncredited copyright violations of both video and audio.

Facebook's "pivot to video" similarly relied on user-uploaded unlicensed video content, now not just pulling from television and film, but from content creators on platforms like YouTube.

Today, every "social" platform is now littered with "no copyright infringement intended" and "all credit to the original" copy-and-paste junk. Don't get me wrong, I'm a fan of remix culture – but I believe appropriating and monetizing the work of others without sharing the reward is a destructive cycle. And while there are avenues for addressing this, they're designed for the likes of Universal, Sony, Disney, etc. (I've had original recordings of original music flagged by megacorps because the applause triggered ContentID.)

AI slop further poisons the well. It's rough going out there.

Workaccount2

The common meme is that megacorps are shamelessly criminalistic organizations that get away with doing anything they can to maximize profits, while true in some regard, totally pales in comparison to the illegal things small businesses and start-ups do.

reaperducer

Apparently it's a common business practice.

It's not a common business practice. That's why it's considered newsworthy.

People on the internet have forgotten that the news doesn't report everyday, normal, common things, or it would be nothing but a listing of people mowing their lawns or applying for business loans. The reason something is in the news is because it is unusual or remarkable.

"I saw it online, so it must happen all the time" is a dopy lack of logic that infects society.

marapuru

You are right on that. I’ll edit my post to reflect that.

Edit: Apologies, I can’t edit it anymore.

NoMoreNicksLeft

This isn't as meaningful as it sounds. Nintendo was apparently using scene roms for one of the official emulators on Wii (I think?). Spotify might have received legally-obtained mp3s from the record companies that were originally pulled from Napster or whatever, because the people who work for record companies are lazy hypocrites.

idonotknowwhy

The Nes classic console. The roms had an iNes emulator header lol.

And the playstation classic used an opensource ps1 emulator.

There was also some steam game ported from GameCube, and it had the Dolphin Emulator FPS counter in the corner of part of the trailer :D

I also remember reading that 2 of the PCSX2 devs ended up working on the EmotionEngine chip emulator for PS3 consoles with partial software emulation of PS2 (The CECH 02 and later models where they removed the EmotionEngine chip)

lysace

You are missing the point. Spotify had permission from the copyright holders and/or their national proxies to use those songs in a limited beta in Sweden. They didn't have access to clean audio data directly from the record companies, so in many cases they used pirated rips instead.

What you really should be asking is whether they infringed on the copyrights of the rippers. /s

motbus3

They had a second company (which I don't remember the name) that allowed users to backup and share their music. When they were exposed they dug that as deep as they could

pyman

No. There's no credible evidence Spotify had any secret second company that allowed users to back up and share music without authorisation

pyman

It was the opposite. Their mission was to combat music piracy by offering a better, legal alternative.

Daniel Ek said: "my mission is to make music accessible and legal to everyone, while ensuring artists and rights holders got paid"

Also, the Swedish government has zero tolerance for piracy.

eviks

Mission is just words, they can mean the opposite of deeds, but they can't be the opposite, they live in different realms.

pyman

I know this might come as a shock to those living in San Francisco, but things are different in other parts of the world, like Uruguay, Sweden and the rest of Europe. From what I’ve read, the European committee actually cares about enforcing the law.

pyman

Anthropic's cofounder, Ben Mann, downloaded million copies of books from Library Genesis in 2021, fully aware that the material was pirated.

Stealing is stealing. Let's stop with the double standards.

originalvichy

At least most pirates just consume for personal use. Profiting from piracy is a whole other level beyond just pirating a book.

mnky9800n

I feel like profit was always a central motive of pirates. At least from the historical documents known as, "The Pirates of the Caribbean".

KoolKat23

This isn't really profiting from piracy. They don't make money off the raw input data. It's no different to consuming for personal use.

They make money off the model weights, which is fair use (as confirmed by recent case law).

j_w

This is absurd. Remove all of the content from the training data that was pirated and what is the quality of the end product now?

pyman

With Claude, people are paying Anthropic to access answers that are generated from pirated books, without the authors permission, credit, or compensation.

KoolKat23

There is no copyright on knowledge.

If it outputs parts of the book verbatim then that's a different story.

pyman

Let's don't change the focus of the debate.

Pirating 7 million books, remixing their content, and using that to power Claude.ai is like counterfeiting 7 million branded products and selling them on your personal website. The original creators don't get credit or payment, and someone’s profiting off their work.

All this happens while authors, many of them teachers, are left scratching their heads with four kids to feed

KoolKat23

That may be the case, but you'd have to have laws changed.

SirMaster

>If it outputs parts of the book verbatim then that's a different story.

But it does...

KoolKat23

That's the law.

Please keep in mind, copyright is intended as a compromise between benefit to society and to the individual.

A thought experiment, students pirating textbooks and applying that knowledge later on in their work?

nwienert

Its the law (for now, very early on this in the process of deciding the law, untested, appealable, likely to be appealed and tested many times in many ways).

Meanwhile other cases have been less friendly to it being fair use, AI companies are already paying vast sums to publishers who presumably they wouldn’t if they felt confident it was “the law”, and on and on.

I don’t like arguing from “it’s the law”. A lot of law is terrible. What’s right? It’s clear to me that if AI gets good enough, as it nearly is now, it sucks a lot of profit away from creators. That is unbalanced. The AI doesn’t exist without the creators, the creators need to exist for our society to be great (we want new creative works, more if anything). Law tends to start conservatively based on historical precedent, and when a new technology comes along it often errs on letting it do some damage to avoid setting a bad precedent. In time it catches up as society gets a better view of things.

The right thing is likely not to let our creative class be decimated so a few tech companies become fantastically wealthy - in the long run, it’s the right thing even for the techies.

j_w

When you say that's the law, as far as I'm aware a single ruling by a lower court has been issued which upholds that application. Hardly settled case law.

KoolKat23

True, until then best to act as if it is the case.

In my opinion, it will be upheld.

Looking at what is stored and the manner which it is stored. It makes sense that it's fair use.

j_w

We're talking about a summary judgement issued that has not yet been appealed. That doesn't make it "settled."

If by "what is stored and the manner which it is stored" is intended to signal model weights, I'm not sure what the argument is? The four factors of copyright in no way mention a storage medium for data, lossless or loss-y.

(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work.

In my opinion, this will likely see a supreme court ruling by the end of the decade.

KoolKat23

The use is to train an AI model.

A trillion parameter SOTA model is not substantially comprised of the one copyrighted piece. (If it was a Harry Potter model trained only on Harry Potter books this would be a different story).

Embeddings are not copy paste.

The last point about market impact would be where they make their argument but it's tenuous. It's not the primary use of AI models and built in prompts try to avoid this, so it shouldn't be commonplace unless you're jail breaking the model, most folk aren't.

nwienert

I bet it’s pretty easy to reproduce enough of Harry Potter from these models that any judge would see it as not fair use - you’d just have to prompt it in the right way. I’d bet a large sum that when this eventually shakes through the Supreme Court, it won’t be deemed fair use entirely, for the better of the world.

KoolKat23

If I use my Microsoft Word processor in the right way I can reproduce it too.

mrcwinn

> At least most pirates just consume for personal use.

Easy for the pirate to say. Artists might argue their intent was to trade compensation for one's personal enjoyment of the work.

Workaccount2

The gut punch of being a photographer selling your work on display, someone walks by and lines up their phone to take a perfect picture of your photograph, and then exclaims to you "Your work is beautiful! I can't wait to print this out and put it on my wall!"

jobs_throwaway

[flagged]

SketchySeaBeast

The way you've presented this, the evidence is just "common sense", which isn't much evidence at all.

jobs_throwaway

wrong

https://news.iu.edu/live/news/25720-the-hidden-treasure-of-d...

SketchySeaBeast

"Wrong" would imply somehow the first post was better that it was. What you mean to say is "You're right, here's a link with some details".

That article also focuses on larger media and "moderate" amounts of piracy, so there's absolutely caveats in your claim.

"As with other studies, Kim and his colleagues found that when enforcement is low and piracy is rampant, both manufacturers and retailers suffer."

“'The implication is simply that, situated in a real-world context, our manufacturer and retailer should recognize that a certain level of piracy or its threat might actually be beneficial'..."

dathinab

stealing with the intent to gain a unfair marked advantage so that you can effectively kill any ethically legally correctly acting company in a way which is very likely going to hurt many authors through the products you create is far worse then just stealing for personal use

that isn't "just" stealing, it's organized crime

kube-system

> Stealing is stealing. Let's stop with the double standards.

I get the sentiment, but that statement as is, is absurdly reductive. Details matter. Even if someone takes merchandise from a store without paying, their sentence will vary depending on the details.

1970-01-01

Let's get actual definitions of 'theft' before we leap into double standards.

NoMoreNicksLeft

>Stealing is stealing.

Yes, but copying isn't stealing, because the person you "take" from still has their copy.

If you're allowed to call copying stealing, then I should be allowed to call hysterical copyright rabblerousing rape. Quit being a rapist, pyman.

x3n0ph3n3

impossiblefork

It's very similar to theft of service.

There's so many texts, and they're so sparse that if I could copyright a work and never publish it, the restriction would be irrelevant. The probability that you would accidentally come upon something close enough that copyright was relevant is almost infinitesimal.

Because of this copyright is an incredibly weak restriction, and that it is as weak as it is shows clearly that any use of a copyrighted work is due to the convenience that it is available.

That is, it's about making use of the work somebody else has done, not about that restricting you somehow.

Therefore copyright is much more legitimate than ordinary property. Ordinary property, especially ownership of land, can actually limit other people. But since copyright is so sparse infringing on it is like going to world with near-infinite space and picking the precise place where somebody has planted a field and deciding to harvest from that particular field.

Consequently I think copyright infringement might actually be worse than stealing.

jpalawaga

you've created a very obvious category mistake in your final summary by confusing intellectual property--which can be copied at no penalty to an owner (except nebulous 'alternate universe' theories)--with actual property, and a farmer and his land, with a crop that cannot be enjoyed twice.

you're saying copying a book is worse than robbing a farmer of his food and/or livelihood, which cannot be replaced to duplicated. Meanwhile, someone who copies a book does not deprive the author of selling the book again (or a tasty proceedings from harvest).

I can't say I agree, for obvious reasons.

impossiblefork

With this special infinite-land-land though, what's special about the farmer's land is that he's expended energy to make it that way, just as the author has expended energy to find his text.

Just as the farmer obtains his livelihood from the investment-of-energy-to-raise-crops-to-energy cycle the author has his livelihood by the investment-of-energy-to-finding-a-useful-work-to-energy cycle.

So he is in fact robbed in a very similar way.

jpalawaga

You're saying that a copy of a digital thing is the same as the "only" of a physical thing. But that's not true. You can't sell grain twice, but you can sell a movie many times (especially when you account for format changes, remasterings, platform locks, licensing for special usecases like remixing, broadcasts, etc).

You'd have to steal the author's ownership of the intellectual property in order for the comparison to be valid, just as you stole ownership of his crop.

Separately, there is a reason why theft and copyright infringement are two distinct concepts in law.

impossiblefork

The difference here though is that the copyright holder sustains himself by the sales of his particular chosen text, so it doesn't matter that the text can be reproduced infinitely.

jpalawaga

If you assume only way people are obtaining the media is by unlicensed reproduction, then it doesn’t matter.

Big if. Practically, the movie studios aren’t poor because their product has instances of infringement.

impossiblefork

Many people who write books are small-time. Apparently the median total income of full-time book authors is 20k.

So the median person who is harmed when something competes with his book authorship is someone making 20k p.a., not someone who is a major shareholder of a big firm.

CaptainFever

> Consequently I think copyright infringement might actually be worse than stealing.

I remember when piracy wasn't theft, and information wanted to be free.

impossiblefork

So do I, then I found this reasoning I presented in my comment and realised that piracy was actually quite bad.

Ordinary property is much worse than copyright, which is both time limited and not necessarily obtained through work, and which is much more limited in availability than the number of sequences.

When someone owns land, that's actually a place you stumble upon and can't enter, whereas you're not going to ever stumble upon the story of even 'Nasse hittar en stol' (swedish 'Nasse finds a chair') a very short book for very small children.

seydor

property infringement isn't either?

eviks

If you infringe by destroying property, then yes, it's not stealing

1oooqooq

actually, the Only time it's a (ethical) crime is when a corporation does it at scale for profit.

pyman

[flagged]

BlackFly

Making a copy differs from taking an existing object in all aspects: literally, technically, legally and ethically. Piracy is making a copy you have no legal right to. Stealing is taking a physical object that you have no legal right to. While the "no legal right to" seems the same superficially, in practice the laws differ quite a bit because the literal, technical and ethical aspects differ.

thedevilslawyer

Where can I download Harry Potter on claude.ai pls?

slater

Why would you want to download a shitty book?

TiredOfLife

They are not selling it on claude.ai. If you can prove that they are you will be rich.

zb3

Who got robbed? Just because I'd pay for AI it doesn't mean I'd buy these books.

pyman

You should ask the teachers who spent years writing those books.

azangru

You keep saying the word "teachers"; but that word does not appear in the text of the article. Why focus on the teachers in particular?

Also, there are various incentives for teachers to publish books. Money is just one of them (I wonder how much revenue books bring to the teachers). Prestige and academic recognition is another. There are probably others still. How realistic is the depiction of a deprived teacher whose livelihood depended on the books he published once every several years?

pyman

I can only speak for myself, my family, and colleagues, but it's not just teachers. Plenty of other professionals are being affected too.

zb3

I did not ask them to write those books, and I wouldn't buy those.

pyman

Doesn't make sense what you said. If you're using AI, you might be benefiting from my books, whether you realise it or not.

damnesian

oh well, the product has a cute name and will make someone a billionaire, let's just give it the green light. who cares about copyright in the age of AI?

Der_Einzige

Information wants to be free.

troyvit

Then why does Claude cost money?

akimbostrawman

Because regurgitated information is not the same

ramon156

Pirate and pay the fine is probably hell of a lot cheaper than individually buying all these books. I'm not saying this is justified, but what would you have done in their situation?

Sayi "they have the money" is not an argument. It's about the amount of effort that is needed to individually buy, scan, process millions of pages. If that's done for you, why re-do it all?

pyman

The problem with this thinking is that hundreds of thousands of teachers who spent years writing great, useful books and sharing knowledge and wisdom probably won't sue a billion dollar company for stealing their work. What they'll likely do is stop writing altogether.

I'm against Anthropic stealing teacher's work and discouraging them from ever writing again. Some teachers are already saying this (though probably not in California).

js8

> The problem with this thinking is that hundreds of thousands of teachers who spent years writing great, useful books and sharing knowledge and wisdom probably won't sue a billion dollar company for stealing their work. What they'll likely do is stop writing altogether.

I think this is a fantasy. My father cowrote a Springer book about physics. For the effort, he got like $400 and 6 author copies.

Now, you might say he got a bad deal (or the book was bad), but I don't think hundreds of thousands of authors do significantly better. The reality is, people overwhelmingly write because they want to, not because of money.

pyman

I see where you are coming from: "My 8-yo son can also build websites".

Writing books is a profession.

Some people write full-time and make a living from it, through book sales, speaking gigs, teaching, or other related work.

Maybe ask Tim O’Reilly what he thinks about this so-called fantasy.

Like I said, Anthropic needs to stop stealing books or face the consequences.

js8

No you don't see where I am coming from. And my father was a university professor. I am certainly not opposed to authors being fairly remunerated for their work, that's why I brought up that example.

My point is, the controversy is not an AI corporation vs 10^5 ordinary teachers. It's a battle of two corporations, or business models, if you will. But regardless of the result, most of the book authors will continue to get screwed, maybe the means will change. But it will not prevent them from writing, either. So I don't see any mass writers protests coming, sorry.

I also don't think Anthropic AI is going to be any less intelligent if it didn't read any modern fiction book, instead of reading a Wikipedia summary. Stories and myths are a human way of understanding the world, machines probably don't need them. And for non-fiction books - there really isn't that many irreplaceable high-profile authors out there. If it can't read, say, Feynman's Lectures on Physics, it can learn the same from 100s of other physics textbooks. Maybe they are slightly worse organized but why should superintelligence care?

NoMoreNicksLeft

Stealing? In what way?

Training a generative model on a book is the mechanical equivalent of having a human read the book and learn from it. Is it stealing if a person reads the book and learns from it?

janalsncm

> In what way?

Downloading the book without paying for it, which is more or less what the judge said.

blocko

Depends on how closely that person can reproduce the original work without license or attribution

lcnPylGDnU4H9OF

It actually depends on whether or not they reproduce it and especially what they do with the copy after making it.

blocko

Sure. I'd say reproducing and distributing it to someone who happens to ask the right questions would qualify

lcnPylGDnU4H9OF

Well, right, but that's different from "can reproduce the original work". I "can" start typing out song lyrics but it doesn't mean that I stole the songs I've listened to.

coffeefirst

But a language model is not a person, it’s a copy machine with a blender inside.

Photocopying books in their entirety for commercial use is absolutely illegal.

NoMoreNicksLeft

>But a language model is not a person, it’s a copy machine with a blender inside.

I do not know what a person is, but I've met many people dimwitted enough that they might just be a meat LLM. LLMs aren't artificial intelligence (or even anything close), but they definitely aren't "copy machines with a blender inside"... and suggesting that they are marks you one of the dimwits I just mentioned.

>Photocopying books in their entirety for commercial use is absolutely illegal.

Non sequitur. This isn't "photocopying books" or anything even slightly similar. Your arguments are disingenuous.

Der_Einzige

[flagged]

lofaszvanitt

They won't be needed anymore, once singularity is reached. This might be their thought process. This also exemplifies that the loathed caste system found in India is indeed in place in western societies.

There is no equality, and seemingly there are worker bees who can be exploited, and there are privileged ones, and of course there are the queens.

SketchySeaBeast

> They won't be needed anymore, once singularity is reached.

And it just so happens that that belief says they can burn whatever they want down because something in the future might happen that absolves them of those crimes.

pyman

Note: My definition of singularity isn't the one they use in San Francisco. It's the moment founders who stole the life's work of thousands of teachers finally go to prison, and their datacentres get seized.

lofaszvanitt

You can bet that this never gonna happen...

covercash

When the rich and powerful face zero consequences for breaking laws and ignoring the social contracts that keep our society functioning, you wind up with extreme overcorrections. See Luigi.

achierius

How extreme is that, really? Not to justify murder: that is clearly bad. But "killing one man" is evidently something we, as a society, consider an "acceptable side-effect" when a corporation does it -- hell, you can kill thousands and get away scot-free if you're big enough.

Luigi was peanuts in comparison.

“THERE were two “Reigns of Terror,” if we would but remember it and consider it; the one wrought murder in hot passion, the other in heartless cold blood; the one lasted mere months, the other had lasted a thousand years; the one inflicted death upon ten thousand persons, the other upon a hundred millions; but our shudders are all for the “horrors” of the minor Terror, the momentary Terror, so to speak; whereas, what is the horror of swift death by the axe, compared with lifelong death from hunger, cold, insult, cruelty, and heart-break? What is swift death by lightning compared with death by slow fire at the stake? A city cemetery could contain the coffins filled by that brief Terror which we have all been so diligently taught to shiver at and mourn over; but all France could hardly contain the coffins filled by that older and real Terror—that unspeakably bitter and awful Terror which none of us has been taught to see in its vastness or pity as it deserves.”

- Mark Twain

glimshe

That will be sad, although there will still be plenty of great people who will write books anyway.

When it comes to a lot of these teachers, I'll say, copyright work hand in hand with college and school course book mandates. I've seen plenty of teachers making crazy money off students' backs due to these mandates.

A lot of the content taught in undergrad and school hasn't changed in decades or even centuries. I think we have all the books we'll ever need in certain subjects already, but copyright keeps enriching people who write new versions of these.

CuriouslyC

If you care so little about writing that AI puts you off it, TBH you're probably not a great writer anyhow.

Writers that have an authentic human voice and help people think about things in a new way will be fine for a while yet.

4b11b4

Yeah, people will still want to write. They might need new ways to monetize it... that being said, even if people still want to write they may not consider it a viable path. Again, have to consider other monetization.

TimorousBestie

150K per work is the maximum fine for willful infringement (which this is).

105B+ is more than Anthropic is worth on paper.

Of course they’re not going to be charged to the fullest extent of the law, they’re not a teenager running Napster in the early 2000s.

dragonwriter

> 150K per work is the maximum fine for willful infringement

No, its not.

It's the maximum statutory damages for willful infringement, which this has not be adjudicated to be. it is not a fine, its an alternative to basis of recovery to actual damages + infringers profits attributable to the infringement.

Of course, there's also a very wide range of statutory damages, the minimum (if it is not "innocent" infringement) is $750/work.

> 105B+ is more than Anthropic is worth on paper.

The actual amount of 7 million works times $150,000/work is $1.05 trillion, not $105 billion.

TimorousBestie

> It's the maximum statutory damages for willful infringement, which this has not be adjudicated to be. it is not a fine, its an alternative to basis of recovery to actual damages + infringers profits attributable to the infringement.

Yeah, you’re probably right, I’m not a lawyer. The point is that it doesn’t matter what number the law says they should pay, Anthropic can afford real lawyers and will therefore only pay a pittance, if anything.

I’m old enough to remember what the feds did to Aaron Schwarz, and I don’t see what Anthropic did that was so different, ethically speaking.

voxic11

Even if they don't qualify for willful infringement damages (lets say they have a good faith belief their infringement was covered by fair use) the standard statutory damages for copyright infringement are $750-$30,000 per work.

eikenberry

Plus they did it with a profit motive which would entail criminal proceedings.

glimshe

Isn't "pirating" a felony with jail time, though? That's what I remember from the FBI warning I had to see at the beginning of every DVD I bought (but not "pirated" ones).

voxic11

Yes criminal copyright infringement (willful copyright infringement done for commercial gain or at a large scale) is a felony.

pyman

[flagged]

dmix

A court just ruled on Anthropic and said an LLM response wasn't a form of counterfeiting (ie, essentially selling pirate books on the black market). Although tbf that is the most radical interpretation still being put forward by the lawyers of publishers like NYTimes, despite the obvious flaws.

pyman

Anthropic still faces billions of dollars in damages for pirating over 7 million books to build a digital library.

The trial is scheduled for December 2025. That's when a jury will decide how much Anthropic owes for copying and storing those pirated books

Kim_Bruning

What someone at Anthropic did was download libgen once, then Anthropic figured "wait a minute, isn't that illegal?" , so instead they went and bought 7 million books for real and cut them up to scan them.

Turns out this doesn't quite mitigate downloading them first. (Though frankly, I'm very much against people having to buy 7 million books when someone has already scanned them)

mystified5016

No, it isn't.

suyjuris

Just downloading them is of course cheaper, but it is worth pointing out that, as the article states, they did also buy legitimate copies of millions of books. (This includes all the books involved in the lawsuit.) Based on the judgement itself, Anthropic appears to train only on the books legitimately acquired. Used books are quite cheap, after all, and can be bought in bulk.

asadotzler

Buying a book is not license to re-sell that content for your own profit. I can't buy a copy of your book, make a million Xeroxes of it and sell those. The license you get when you buy a book is for a single use, not a license to do what ever you want with the contents of that book.

suyjuris

Yes, of course! In this case, the judge identified three separate instances of copying: (1) downloading books without authorisation to add to their internal library, (2) scanning legitimately purchased books to add to their internal library, and (3) taking data from their internal library for the purposes of training LLMs. The purchasing part is only relevant for (2) — there the judge ruled that this is fair use. This makes a lot of sense to me, since no additional copies were created (they destroyed the physical books after scanning), so this is just a single use, as you say. The judge also ruled that (3) is fair use, but for a different reason. (They declined to decide whether (1) is fair use at this point, deferring to a later trial.)

thedevilslawyer

What are you on about - the judge has literally said this was not resell, and is transformative and fair use.

maeln

If you wanted to be legit with 0 chance of going to court, you would contact publisher and ask to pay a license to get access to their catalog for training, and negotiate from that point.

This is what every company using media are doing (think Spotify, Netflix, but also journal, ad agency, ...). I don't know why people in HN are giving a pass to AI company for this kind of behavior.

CaptainFever

> I don't know why people in HN are giving a pass to AI company for this kind of behavior.

As mentioned in The Fucking Article, there's a legal difference between training an AI which largely doesn't repeat things verbatim (ala Anthropic) and redistributing media as a whole (ala Spotify, Netflix, journal, ad agency).

pyman

[flagged]

edgineer

The paradigm is that teachers will teach life skills like public speaking and entrepreneurship. Book smarts that can be more effectively taught by AI will be, once schools catch up.

pyman

I agree. The world is changing fast, but we need to make the transition less painful for everyone. The way things are going now only benefits big tech.

ohashi

Because they are mostly software developers who think it's different because it impacts them.

darkoob12

This is not about paying for a single copy. It would still be wrong even if they have bought every single one of those books. It is a form of plagiarism. The model will use someone else's idea without proper attribution.

jeroenhd

Legally speaking, we don't know that yet. Early signs are pointing at judges allowing this kind of crap because it's almost impossible for most authors to point out what part of the generated slop was originally theirs.

tmaly

At minimum they should have to buy the book they are deriving weights from.

SirMaster

But should the purchase be like a personal license? Or like a commercia license that costs way more?

Because for example if you buy a movie on disc, that's a personal license and you can watch it yourself at home. But you can't like play it at a large public venue that sell tickets to watch it. You need a different and more expensive license to make money off the usage of the content in a larger capacity like that.

blibble

> Pirate and pay the fine is probably hell of a lot cheaper than individually buying all these books.

$500,000 per infringement...

jandrese

And the crazy thing is that might be cheaper when you consider the alternative is to have your lawyers negotiate with the lawyers for the publishing companies for the right to use the works as training data. Not only is it many many billable hours just to draw up the contract, but you can be sure that many companies would either not play ball or set extremely high rates. Finally, if the publishing companies did bring a suit against Anthropic they might be asked to prove each case of infringement, basically to show that a specific work was used in training, which might be difficult since you can't reverse a model to get the inputs. When you're a billion dollar company it's much easier to get the courts to take your side. This isn't like the music companies suing teenagers who had a Kazaa account.

kevingadd

Google did it the legal way with Google Books, didn't they?

pyman

[flagged]

suyjuris

The judge appears to disagree with you on this. They found that training and selling an LLM are fair use, based on the fact that it is exceedingly transformative, and that the copyright holders are not entitled to any profits thereof due to copyright. (They also did get paid — Anthropic acquired millions of books legally, including all of the authors in this complaint. This would not retroactively absolve them of legal fault for past infringements, of course.)

pyman

The trial is scheduled for December 2025. That's when a jury will decide how much Anthropic owes for copying and storing over seven million pirated books

suyjuris

Yes, that would by an interesting trial. But it is only about six books, and all claims regarding Claude have been dismissed already. So only the internal copies remain, and there the theory for them being infringing is somewhat convoluted: you have to argue that they are not just for purposes of training (which was ruled fair use), and award damages even though these other purposes never materialised (since by now, they have legal copies of those books). I can see it, but I would not count on there being a trial.

pyman

A couple of books?

Anthropic faces billions of dollars in damages for pirating over 7 million books to build a digital library.

The trial is scheduled for December 2025. That’s when a jury will decide how much Anthropic owes for copying and storing over 7 million pirated books

flaptrap

The fallacy in the 'fair use' logic is that a person acquires a book and learns from it, but a machine incorporates the text. Copyright does not allow one to create a derivative work without permission. Only when the result of the transformation resembles the original work could it be said that it is subject to copyright. Do not regard either of those legal issues are set in concrete yet.

mensetmanusman

Both a human and a machine learn from it. You can design an LLM that doesn’t spit back the entire text after annealing. It just learns the essence like a human.

badmintonbaseba

Morally maybe, but AFAIK machines "learning" and creating creative works on their own is not recognized legally, at least certainly not the same way as for people.

Workaccount2

>AFAIK machines "learning" and creating creative works on their own is not recognized legally

Did you read the article? The judge literally just legally recognized it.

bmitc

> I'm not saying this is justified, but what would you have done in their situation?

Individuals would have their lives ruined either from massive fines or jail time.

pyman

These are the people shaping the future of AI? What happened to all the ethical values they love to preach about?

We've held China accountable for counterfeiting products for decades and regulated their exports. So why should Anthropic be allowed to export their products and services after engaging in the same illegal activity?

ffsm8

> We've held China accountable for counterfeiting products for decades and regulated their exports

We have? Are we from different multi-verses?

The one I've lived in to date has not done anything against Chinese counterfeits beyond occasionally seizing counterfeit goods during import. But that's merely occasionally enforcing local counterfeit law, a far cry from punishing the entity producing it.

As a matter of fact, the companies started outsourcing everything to China, making further IP theft and quasi-copies even easier

Workaccount2

I was gonna say, the enforcement is so weak that it's not even really worth it to pursue consumer hardware here in the US. Make product that is a hit, patent it, and still 1 month later IYTUOP will be selling an identical copy for 1/3rd the price on Amazon.

delfinom

Patent enforcement requires the patent holder to go after violators. The said thing is, there are grounds to sue Amazon facilitating it, just nobody has had the money to do it. And no big company ever will because of the threat of being locked out of AWS.

It's quite the mafia operation over at Amazon.

janalsncm

IP theft is one of the stated reasons for the trade war in the first place. It’s one of the major gripes the US has against China. There are limited means available to restrict a foreign country compared with an entity in the US. The DoJ did sue Huawei and win though.

Whether or not the countermeasures have been effective in practice is a minor detail in the GP point that we would not expect an American company headquartered in the US and conducting significant business in the US to get away with the same thing.

benjiro

One rule for you, one rule for me ...

You never noticed the hypocrite behavior all over society?

* O, you drunk drive, big fine, lots of trouble. * O, you drunk drive and are a senator, cop, mayor, ... Well, lets look the other way.

* You have anger management issues and slam somebody to the ground. Jail time. * You as a cop have anger management issues and slams somebody to the ground. Well, paid time off while we investigate and maybe a reprimand. Qualified immunity boy!

* You tax fraud for 10k, felony record, maybe jail time. * You as a exec of a company do tax fraud for 100 million. After 10 years lawyering around, maybe you get something, maybe, ... o, here is a fine of 5 million.

I am sorry but the idea of everybody being equal under the law has always been a illusion.

We are holding China accountable for counterfeiting products because it hurts OUR companies, and their income. But when its "us vs us", well, then it becomes a bit more messy and in general, those with the biggest backing (as in $$$, economic value, and lawyers), tends to win.

Wait, if somebody steal my book, i can sue that person in court, and get a payout (lawyers will cost me more but that is not the point). If some AI company steals my book, well, the chance you win is close to 1%, simply because lots of well paid lawyers will make your winning hard to impossible.

Our society has always been based upon power, wealth and influence. The more you have of it, the more you get away (or reduced) with things, that gets other fined or jailed.

wmf

The unethical ones didn't buy any books.

seydor

break things and move fast

carlosjobim

Why is it unethical of them to use the information in all these books? They are clearly not reselling the books in any way, shape, or form. The information itself in a book can never be copyrighted. You can also publish and sell material where you quote other books within it.

lofaszvanitt

This is the underlying caste system coming to life right before your eyes :D.

stephenitis

I think caste system is the wrong analogy here.

Comment is more about the pseudo ethical high ground

MangoToupe

Companies being above the law does create a stratified system in this country for those who can benefit from said companies and those who cannot. Call it what you like.

bmitc

Silicon Valley has always been the antithesis of ethics. It's foundations are much more right wing and libertarian, along the extremist lines.

DrillShopper

[flagged]

guywithahat

If you own a book, it should be legal for your computer to take a picture of it. I honestly feel bad for some of these AI companies because the rules around copyright are changing just to target them. I don't owe copyright to every book I read because I may subconsciously incorporate their ideas into my future work.

raincole

Are we reading the same article? The article explicitly states that it's okay to cut up and scan the books you own to train a model from them.

> I honestly feel bad for some of these AI companies because the rules around copyright are changing just to target them

The ruling would be a huge win for AI companies if held. It's really weird that you reached the opposite conclusion.

Bjorkbat

Something missed in arguments such as these is that in measuring fair use there's a consideration of impact on the potential market for a rightsholder's present and future works. In other words, can it be proven that what you are doing is meaningfully depriving the author of future income.

Now, in theory, you learning from an author's works and competing with them in the same market could meaningfully deprive them of income, but it's a very difficult argument to prove.

On the other hand, with AI companies it's an easier argument to make. If Anthropic trained on all of your books (which is somewhat likely if you're a fairly popular author) and you saw a substantial loss of income after the release of one of their better models (presumably because people are just using the LLM to write their own stories rather than buy your stuff), then it's a little bit easier to connect the dots. A company used your works to build a machine that competes with you, which arguably violates the fair use principle.

Gets to the very principle of copyright, which is that you shouldn't have to compete against "yourself" because someone copied you.

parliament32

> a consideration of impact on the potential market for a rightsholder's present and future works

This is one of those mental gymnastics exercises that makes copyright law so obtuse and effectively unenforceable.

As an alternative, imagine a scriptwriter buys a textbook on orbital mechanics, while writing Gravity (2013). A large number of people watch the finished film, and learn something about orbital mechanics, therefore not needing the textbook anymore, causing a loss of revenue for the textbook author. Should the author be entitled to a percentage of Gravity's profit?

We'd be better off abolishing everything related to copyright and IP law alltogether. These laws might've made sense back in the days of the printing press but they're just nonsensical nowadays.

const_cast

Well I mean you're constructing very convoluted and weak examples.

I think, in your example, the obvious answer is no, they're not entitled to any profits of Gravity. How could you possibly prove Gravity has anything to do with someone reading, or not reading, a textbook? You can't.

However, AI participates in the exact same markets it trains from. That's obviously very different. It is INTENDED to DIRECTLY replace the things it trains on.

Meaning, not only does an LLM output directly replace the textbook it was trained on, but that behavior is the sole commercial goal of the company. That's why they're doing it, and that's the only reason they're doing it.

parliament32

> It is INTENDED to DIRECTLY replace the things it trains on.

Maybe this is where I'm having trouble. You say "exact same markets" -- how is a print book the exact same market as a web/mobile text-generating human-emulating chat companion? If that holds, why can't I say a textbook is the exact same market as a film?

I could see the argument if someone published a product that was fine-tuned on a specific book, and marketed as "use this AI instead of buying this book!", but that's not the case with any of the current services on the market.

I'm not trying to be combative, just trying to understand.. they seem like very different markets to me.

const_cast

> how is a print book the exact same market as a web/mobile text-generating human-emulating chat companion? If that holds, why can't I say a textbook is the exact same market as a film?

Because the medium is actually the same. The content of a book is not paper, or a cover. It's text, and specifically the information in that text.

LLMs are intended to directly compete with and outright replace that usecase. I don't need a textbook on, say, Anatomy, because ChatGPT can structure and tell me about Anatomy, and in fact with say the exact same content slightly re-arranged.

This doesn't really hold for fictional books, nor does it hold for movies.

Watching a movie and reading a book are inherently different experiences, which cannot replace one another. Reading a textbook and asking ChatGPT about topic X is, for all intents and purposes, the same experience. Especially since, remember, most textbooks are online today.

fragmede

Is it? If a teacher reads a book, then gives a lecture on that topic, that's decidedly not the same experience. Which step about that process makes it not the same experience? Is it the fact that they read the book using their human brain and then formed words in a specific order? Is it the fact that they're saying it out loud that's transformative? If we use ChatGPT's TTS feature, why is that not the same thing as a human talking about a topic after they read a book since it's been rearranged?

const_cast

Well there's multiple reasons why it's not the same experience. It's a different medium, but it's also different content. The textbook may be used as a jumping-off point, supplemented by decades of real-life experience the professor has.

And, I think, elephant in the room with these discussions: we cannot just compare ChatGPT to a human. That's not a foregone conclusions and, IMO, no, you can't just do that. You have to justify it.

Humans are special. Why? Because we are Human. Humans have different and additional rights which machines, and programs, do not have. If we want to extend our rights to machines, we can do that... but not for free. Oh no, you must justify that, and it's quiet hard. Especially when said machines appear to work against Humans.

Bjorkbat

Personally I think a more effective analogy would be if someone used a textbook and created an online course / curriculum effective enough that colleges stop recommending the purchase of said textbook. It's honestly pretty difficult to imagine a movie having a meaningful impact on the sale of textbooks since they're required for high school / college courses.

So here's the thing, I don't think a textbook author going against a purveyor of online courseware has much of a chance, nor do I think it should have much of a chance, because it probably lacks meaningful proof that their works made a contribution to the creation of the courseware. Would I feel differently if the textbook author could prove in court that a substantial amount of their material contributed to the creation of the courseware, and when I say "prove" I mean they had receipts to prove it? I think that's where things get murky. If you can actually prove that your works made a meaningful contribution to the thing that you're competing against, then maybe you have a point. The tricky part is defining meaningful. An individual author doesn't make a meaningful contribution to the training of an LLM, but a large number of popular and/or prolific numbers can.

You bring up a good point, interpretation of fair use is difficult, but at the end of the day I really don't think we should abolish copyright and IP altogether. I think it's a good thing that creative professionals have some security in knowing that they have legal protections against having to "compete against themselves"

TeMPOraL

> An individual author doesn't make a meaningful contribution to the training of an LLM, but a large number of popular and/or prolific numbers can.

That's a point I normally use to argue against authors being entitled to royalties on LLM outputs. An individual author's marginal contribution to an LLM is essentially nil, and could be removed from the training set with no meaningful impact on the model. It's only the accumulation of a very large amount of works that turns into a capable LLM.

Bjorkbat

Yeah, this is something I find kind of tricky. I definitely believe that AI companies should get permission from rightsholders to train on their works, but actually compensating them for their works seems pointless. To make the royalties worthwhile you'd have to raise the cost per query to an absolutely absurd level

TeMPOraL

The amounts are not the only problem; there's no good way to measure which input in the training contributed to what degree to the output. I wouldn't be surprised if it turns out it's fundamentally impossible.

Paying everyone a flat rate per query is probably the only way you could do it; any other approach is either going to be contested as unfair in some way, or will be too costly to implement. But then, a flat rate is only fair if it covers everyone in proportion to the contribution, which will get diluted by the portion of training data that's not obviously attributable, like Internet comments or Wikipedia or public domain stuff or internally generated data, so I doubt authors would see any meaningful royalties from this anyway. The only thing it would do, is to make LLMs much more expensive for the society to use.

parliament32

> it's a good thing that creative professionals have some security in knowing that they have legal protections

This argument would make sense if it was across the board, but it's impossible (and pretty ridiculous) to enforce in basically anything except very narrow types of media.

Let's say I come up with a groundbreaking workout routine. Some guy in the gym watches me for a while, adopts it, then goes on to become some sort of bodybuilding champion. I wouldn't be entitled to a portion of his winnings, that would be ridiculous.

Let's say I come up with a cool new fashion style. Someone sees my posts on insta and starts dressing similarly, then ends up with a massive following and starts making money in a modelling career. I wouldn't be entitled to a portion of their income, that would be ridiculous.

And yet, for some reason, media is special.

atomicnumber3

The core problem here is that copyright already doesn't actually follow any consistent logical reasoning. "Information wants to be free" and so on. So our own evaluation of whether anything is fair use or copyrighted or infringement thereof is always going to be exclusively dictated by whatever a judge's personal take on the pile of logical contradictions is. Remember, nominally, the sole purpose of copyright is not rooted in any notions of fairness or profitability or anything. It's specifically to incentivize innovation.

So what is the right interpretation of the law with regards to how AI is using it? What better incentivizes innovation? Do we let AI companies scan everything because AI is innovative? Or do we think letting AI vacuum up creative works to then stochastically regurgitate tiny (or not so tiny) slices of them at a time will hurt innovation elsewhere?

But obviously the real answer here is money. Copyright is powerful because monied interests want it to be. Now that copyright stands in the way of monied interests for perhaps the first time, we will see how dedicated we actually were to whatever justifications we've been seeing for DRM and copyright for the last several decades.

rapind

Everything is different at scale. I'm not giving a specific opinion on copyright here, but it just doesn't make sense when we try to apply individual rights and rules to systems of massive scale.

I really think we need to understand this as a society and also realize that moneyed interests will downplay this as much as possible. A lot of the problems we're having today are due to insufficient regulation differentiating between individuals and systems at scale.

zerotolerance

"Judge says training Claude on books was fair use, but piracy wasn't."

organsnyder

The difference here is that an LLM is a mechanical process. It may not be deterministic (at least, in a way that my brain understands determinism), but it's still a machine.

What you're proposing is considering LLMs to be equal to humans when considering how original works are created. You could make the argument that LLM training data is no different from a human "training" themself over a lifetime of consuming content, but that's a philosophical argument that is at odds with our current legal understanding of copyright law.

kevinpet

That's not a philosophical argument at odds with our current understanding of copyright law. That's exactly what this judge found copyright law currently is and it's quoted in the article being discussed.

organsnyder

Thanks for pointing that out. Obviously I hadn't read the whole article. That is an interesting determination the judge made:

> Alsup ruled that Anthropic's use of copyrighted books to train its AI models was "exceedingly transformative" and qualified as fair use, a legal doctrine that allows certain uses of copyrighted works without the copyright owner's permission.

JoeAltmaier

There are still questions: is an AI a 'user' in the copyright sense?

Or even, is an individual operating within the law as fair use, the same as a voracious all-consuming AI training bot consuming everything the same in spirit?

Consider a single person in a National Park, allowed to pick and eat berries, compared to bringing a combine harvester to take it all.

hellohihello135

It’s easy to point fingers at others. Meanwhile the top comment in this thread links to stolen content from Business Insider.

fakeBeerDrinker

How is it stolen from Business Insider? When I visit businessinsider.com/anthropic-cut-pirated-millions-used-books-train-claude-copyright-2025-6 I get the same story. My browser caches the story, and I save it for archival purposes. How is this theft?

hellohihello135

BI decides who can access this content and who will get the paywall. The link to archive page allows people to access this content without permission. That’s called stealing.

fakeBeerDrinker

When I hop on a VPN and enter ingconito mode from a clean browser session, bypassing their paywall, is that stealing? This doesn't meet the definition of stealing that I'm familiar with.

slantaclaus

Woah woah woah I just read it I didn’t sell it to anybody

jtrn

Best godamn comment in this whole thread. Now we can have fun reading the the mental gymnastics !

platunit10

Every time an article like this surfaces, it always seems like the majority of tech folks believe that training AI on copyrighted material is NOT fair use, but the legal industry disagrees.

Which of the following are true?

(a) the legal industry is susceptible to influence and corruption

(b) engineers don't understand how to legally interpret legal text

Most likely option is C, as we've seen this pattern many times before.

OkayPhysicist

There's a lot of conflation of "should/shouldn't" and "is/isn't". The comments by tech folk you're alluding to mostly think that it "shouldn't" be fair use, out of concern about the societal consequences, whereas judges are looking at it and saying that it "is" fair use, based on the existing law.

Any reasonable reading of the current state of fair use doctrine makes it obvious that the process between Harry Potter and the Sorcerer's Stone and "A computer program that outputs responses to user prompts about a variety of topics" is wildly transformative, and thus the usage of the copyrighted material is probably covered by fair use.

CaptainFever

> Every time an article like this surfaces, it always seems like the majority of tech folks believe that training AI on copyrighted material is NOT fair use

Where are you getting your data from? My conclusions are the exact opposite.

(Also, aren't judges by definition the only ones qualified to declare if it is actually fair use? You could make a case that it shouldn't be fair use, but that's different from it being not fair use.)

rockemsockem

Idk, I think most people in tech I talk to IRL think it is fair use?

I think the overly liberal, non-tech crowd has become really vocal on HN as of late and your sample is likely biased by these people.

standardUser

I don't understand at all the resistance to training LLMs on any and all materials available. Then again, I've always viewed piracy as a compatible with markets and a democratizing force upon them. I thought (wrongly?) that this was the widespread progressive/leftist perspective, to err on the side of access to information.

freshtake

If I allegedly train off of your training, which was trained off of copyrighted content under fair use, we're good right?

Just asking for a friend who's into this sort of thing.

kube-system

I know for sure (b) is true. Way too many people on technical forums read legal texts as if the process to interpret laws is akin to a compiler generating a binary.

mrguyorama

Seeing as (a) is true in the US Supreme Court, it's probably at least as true in the lower courts.

redcobra762

It's not likely you've actually gotten the opinion of the "majority of tech folks", just the most outspoken ones, and only in specific bubbles you belong to.

827a

Armchair commentators, including myself, tend to be imprecise when speaking about whether something is illegal, versus something should be illegal. Sometimes due to a misunderstanding of the law, or an over-estimation of the court's authority, or an over-estimation of our legislature's productivity, or just because we're making conversation and like talking.

trinsic2

I'm not seeing how this is fair use in either case.

Someone correct me if I am wrong but aren't these works being digitized and transformed in a way to make a profit off of the information that is included in these works?

It would be one thing for an individual to make person use of one or more books, but you got to have some special blindness not to see that a for-profit company's use of this information to improve a for-profit model is clearly going against what copyright stands for.

skybrian

Copyright is largely about distributing copies. It’s not about making something vaguely similar or about referencing copyrighted work to make something vaguely similar.

Although, there’s an exception for fictional characters:

https://en.m.wikipedia.org/wiki/Copyright_protection_for_fic...

wrs

Copyright is not on “information”, It’s on the tangible expression (i.e., the actual words). “Transformative use” is a defense in copyright infringement.

protocolture

>clearly going against what copyright stands for.

Copyright isnt a digital moat. Its largely an agreement that the work is available to the public, but the creator has a limited amount of time to exploit it at market.

If you sell an AI model, or access to an AI model, theres usually around 0% of the training data redistributed with the model. You cant decompile it and find the book. As you aren't redistributing the original work copyright is barely relevant.

Imagine suggesting that because you own the design of a hammer, that all works created with the hammer belong to you and cant be sold?

That someone came up with a new method of using books as a tool to create a different work, does not entitle the original book author to a cut of the pie.

trinsic2

Available to the public is one thing, but a for-profit company is not "the public". They are providing a service that makes that work, regardless of what ever form it is in, available to the public. This seems like a middle man situation that makes a profit off of access to information, regardless of what form it is in.

kenmacd

> to make a profit off of the information that is included in these works?

Isn't that what a lot of companies are doing, just through employees? I read a lot of books, and took a lot of courses, and now a company is profiting off that information.

jimbob21

They clearly were being digitized, but I think its a more philosophical discussion that we're only banging our heads against for the first time to say whether or not it is fair use.

Simply, if the models can think then it is no different than a person reading many books and building something new from their learnings. Digitization is just memory. If the models cannot think then it is meaningless digital regurgitation and plagiarism, not to mention breach of copyright.

The quotes "consistent with copyright's purpose in enabling creativity and fostering scientific progress." and "Like any reader aspiring to be a writer" say, from what I can tell, that the judge has legally ruled the model can think as a human does, and therefore has the legal protections afforded to "creatives."

trinsic2

In my mind, there is a difference between a person using there own creative thinking to create a derivative work from learning about a subject and making money off of it versus a corporation with a language model that is designed to absorb the works of the entire planet and redisrubtes that information in away that puts them in a centralized position to become an authority on information. With a person, there is a certain responsibility one has to create meaning from that work so that others can experience it. For-profit companies are like machines that have no interest in the creative expression part of this process hence there is a concern that they do not have the best interests of the public at heart.

palmotea

> Simply, if the models can think then it is no different than a person reading many books and building something new from their learnings.

No, that's fallacious. Using anthropomorphic words to describe a machine does not give it the same kinds of rights and affordances we give real people.

pavon

The judge did use some language that analogized the training with human learning. I don't read it as basing the legal judgement on anthropomorphizing the LLM though, but rather discussing whether it would be legal for a human to do the same thing, then it is legal for a human to use a computer to do so.

  First, Authors argue that using works to train Claude’s underlying LLMs was like using
  works to train any person to read and write, so Authors should be able to exclude Anthropic
  from this use (Opp. 16). But Authors cannot rightly exclude anyone from using their works for
  training or learning as such. Everyone reads texts, too, then writes new texts. They may need
  to pay for getting their hands on a text in the first instance. But to make anyone pay
  specifically for the use of a book each time they read it, each time they recall it from memory,
  each time they later draw upon it when writing new things in new ways would be unthinkable.
  For centuries, we have read and re-read books. We have admired, memorized, and internalized
  their sweeping themes, their substantive points, and their stylistic solutions to recurring writing
  problems.

  ...

  In short, the purpose and character of using copyrighted works to train LLMs to generate
  new text was quintessentially transformative. Like any reader aspiring to be a writer,
  Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them — but
  to turn a hard corner and create something different. If this training process reasonably
  required making copies within the LLM or otherwise, those copies were engaged in a
  transformative use.

[1] https://authorsguild.org/app/uploads/2025/06/gov.uscourts.ca...

trinsic2

Yeah I see the point, but is still thing there is a differnce between human learning and machine learning creatively, see my post above connected to the parent.

jimbob21

Actually, it does, at least for this case. The judge just said so.

NoOn3

People have rights, machines don't. Otherwise, maybe give machines the right to vote, for example?...

kube-system

This case is more like:

If a human uses a voting machine, they still have a right to vote.

Machines don't have rights. The human using the machine does.

protocolture

If I can use my brain to learn, I as a human can use my computer to learn.

Its like, taking notes, or google image search caching thumbnails. Honestly we dont even need the learning metaphor to see this is obviously not an infringement.

palmotea

> If I can use my brain to learn, I as a human can use my computer to learn.

No, you can't. The only thing you can learn with is your own mind (e.g. loading notes into your laptop is not you learning).

protocolture

Right but there's more potential reproducibility in the notes than the AI model. Like other than some philosophical aversion to the word "Learn" its clearly not infringing.

>loading notes into your laptop is not you learning

I dont want to get too distracted by this, but tptb really hate me (questioning american notions of excellence draws the eye of sauron) and have limited my posting. Note taking is actually crucial to my actually learning things. I am largely a kinaesthetic learner, but when it comes to pure data retention, if I am not writing it out, it goes straight through. Note taking is crucial to my learning new things and retaining data, and I know I am not the only one.

Heck its a common (or was common) for writers to completely rewrite, by hand the books of "great" authors to try and learn their "voice".

pavon

There is another case where companies slurped up all of the internet and profited off the information, that makes a good comparison - search engines.

Judges consider a four factor when examining fair use[1]. For search engines,

1) The use is transformative, as a tool to find content is very different purpose than the content itself.

2) Nature of the original work runs the full gamut, so search engines don't get points for only consuming factual data, but it was all publicly viewable by anyone as opposed to books which require payment.

3) The search engine store significant portions of the work in the index, but it only redistributes small portions.

4) Search engines, as original devised, don't compete with the original, in fact they can improve potential market of the original by helping more people find them. This has changed over time though, and search engines are increasingly competing with the content they index, and intentionally trying to show the information that people want on the search page itself.

So traditional search which was transformative, only republished small amounts of the originals, and didn't compete with the originals fell firmly on the side of fair use.

Google News and Books on the other hand weren't so clear cut, as they were showing larger portions of the works and were competing with the originals. They had to make changes to those products as a result of lawsuits.

So now lets look at LLMs:

1) LLM are absolutely transformative. Generating new text at users request is a very different purpose and character from the original works.

2) Again runs the full gamut (setting aside the clear copyright infringement downloading of illegally distributed books which is a separate issue)

3) For training purposes, LLMs don't typically preserve entire works, so the model is in a better place legally than a search index, which has precedent that storing entire works privately can be fair use depending on the other factors. For inference, even though they are less likely to reproduce the originals in their outputs than search engines, there are failure cases where an LLM over-trained on a work, and a significant amount the original can be reproduced.

4) LLMs have tons of uses some of which complement the original works and some of which compete directly with them. Because of this, it is likely that whether LLMs are fair use will depend on how they are being used - eg ignore the LLM altogether and consider solely the output and whether it would be infringing if a human created it.

This case was solely about whether training on books is fair use, and did not consider any uses of the LLM. Because LLMs are a very transformative use, and because they don't store original verbatim, it weighs strongly as being fair use.

I think the real problems that LLMs face will be in factors 3 and 4, which is very much context specific. The judge himself said that the plaintiffs are free to file additional lawsuits if they believe the LLM outputs duplicate the original works.

[1] https://fairuse.stanford.edu/overview/fair-use/four-factors/

NoMoreNicksLeft

Digitizing the books is the equivalent of a blind person doing something to the book to make it readable to them... the software can't read analog pages.

Learning from the book is, well, learning from the book. Yes, they intended to make money off of that learning... but then I guess a medical student reading medical textbooks intends to profit off of what they learn from them. Guess that's not fair use either (well, it's really just use, as in the intended use for all books since they were first invented).

Once a person has to believe that copyright has any moral weight at all, I guess all rational though becomes impossible for them. Somehow, they're not capable of entertaining the idea that copyright policy was only ever supposed to be this pragmatic thing to incentivize creative works... and that whatever little value it has disappears entirely once the policy is twisted to consolidate control.

kristofferR

What do you think fair use is? The whole point of the fair use clauses is that if you transform copyrighted works enough you don't have to pay the original copyright holder.

kube-system

Fair use is not, at its core, about transformation. It's about many types of uses that do not interfere with the reasons for the rights we ascribe to authors. Fair use doesn't require transformation.

codedokode

By the way I wonder if recent advancement in protecting Youtube videos from downloaders like yt-d*p are caused by unwillingness to help rival AI companies gather the datasets.

tliltocatl

If the AI movement will manage to undermine Imaginary Property, it would redeem it's externalities threefold.

57473m3n7Fur7h3

I don’t think that’s gonna happen. I think they will manage to get themselves out of trouble for it, while the rest of us will still face serious problems if we are caught torrenting even one singular little book.

CaptainFever

It's already quite widespread and likely legal for average people to train AI models on copyrighted material, in the open weight AI communities like SD and LocalLLaMa.

Please, please differentiate between pirating books (which Anthrophic is liable for, and is still illegal) and training on copyrighted material (which was found to be legal, for both corporations and average people).

tliltocatl

Even so, would be hard to prove that this particular little book wasn't generated by Claude (oopsie, it happens to be a verbatim copy of a copyrighted work, that happens sometimes, those pesky LLMs).

pyman

You just need to audit their system. Shouldn't take more than a couple of hours.

2OEH8eoCRo0

The Ocean Full of Bowling Balls

ttoinou

It would be great, but I think some are worried that new AI BigTech will find a way to continue enforcing IP on the rest of society while it won't exist for them

Imustaskforhelp

I think that we are worried because I think that's exactly what's going to happen/ is happening.

bayindirh

What are your feelings about how the small fish is stripped of their arts, and their years of work becomes just a prompt? Mainly comic artists and small musicians who are doing things they like and putting out for people, but not for much money?

protocolture

>Mainly comic artists and small musicians who are doing things they like and putting out for people, but not for much money?

The number of these artists I have seen receiving some bogus DMCA takedown notice for fan art is crazy.

I saw a bloke give away some of his STL's because he received a takedown request from games workshop and didnt have the funds to fight it.

Its not that I want small artists to lose, its that I want them to gain access to every bloody copyright and trademark so they are more free to create.

Shit Conde Nast managed to pull something like 400 pulps off the market, so they didnt interfere with their newly launched James Patterson collaborations.

tliltocatl

[flagged]

bayindirh

I think I have worded my question wrong. I asked about not about how AI affects the financials of these smaller artists, but their wellbeing in general.

There are many small artists who do this not for money, but for fun and have their renowned styles. Even their styles are ripped off by these generative AI companies and turned into a slot machine to earn money for themselves. These artists didn't consent to that, and this affects their (mental) well-beings.

With that context in mind, what do you think about these people who are not in this for money is ripped out of their years of achievement and their hard work exploited for money by generative AI companies?

It's not about IP (with whatever expansion you prefer) or laws, but ethics in general.

Substitute comics for any medium. Code, music, painting, illustration, literature, short movies, etc.

tliltocatl

I see your point, "AI art" sucks in general and this is ethically sketchy as hell, but AIAK style copying has never been covered by copyright in the first place. Yea, it sucks to be alienated form your works. That's one of the externalites I mentioned in the original comment. But there is simply no remedy there. That's how the reality is.

bayindirh

Thanks for your answer, and taking your time for writing it!

Yes, style copying is generally considered legal, but as another commenter posted in a related thread "scale matters".

Maybe this will be reconsidered in the near future as the scale is in a much more different level with Generative AI. While there can be no technological solution to this (since it's a social problem to begin with), maybe public opinion about this issue will evolve over time.

To be crystal clear: I'm not against the tech. I'm against abusing and exploiting people for solely monetary profit.

frozenseven

(1) You can't copyright an art style. That's not a thing.

(2) Once you make something publicly available, anyone can learn from it. No consent necessary.

(3) Being upset does not grant you special privileges under the law.

(4) If you don't like the idea of paying for AI art, free software is both plentiful and competitive with just about anything proprietary.

bayindirh

(1) I think I acknowledged that on my last comment before you have written that.

(2) Learning and copying are different. If we're going to enter to the "but AI learns just like humans" realm, let's please don't, because ML tries to mimic human learning, they are not the same, even remotely.

(3) I don't think I said laws behave differently because "some humans are upset".

(4) I haven't told anything about paying for (AI) art. I don't pay for AI art, but I pay for "real" art, made by humans. I also make art sometimes, and don't use AI/ML while doing that, and don't feed models with my art (I stopped sharing on multiple platforms because of that).

(5) If we presume that every law is "correct, right and forever", then we have another name for it. It's called dogma. Moreover, we have other "unwritten" rules which generally support or sometimes supersede the laws in place, and the word for that it "ethics".

(6) Staying in the subject of law, as somebody else noted [0], scale matters. While doing style transfer for a couple of things might be considered legal, that needs a reconsideration when scale reaches to today's levels possible with generative AI.

[0]: https://news.ycombinator.com/item?id=44468480

frozenseven

(1) Did you really? You pivoted to a version of the argument where "scale matters". That's also not a thing. If scale matters, it doesn't in terms of style. You could conceivably make an argument based on competition, but even that is very hard to prove in court.

(2) "Copying", as in verbatim using/memorizing sizeable chunks of content and suchlike? No, the goal of ML training is to generalize across concepts. I didn't make an argument based on some direct connection to neuroscience, but I also reject the idea that there isn't one.

(3) So it's an irrelevant point. Some people are upset about technology? I honestly don't care.

(4) You said that the big bad corporations are making money. I'm saying that FOSS alternatives are all over the place.

(5) That so? My code of ethics says that Luddites getting steamrolled is an immense moral good.

(6) See (1).

bayindirh

> Some people are upset about technology? I honestly don't care.

> My code of ethics says that Luddites getting steamrolled is an immense moral good.

Looks like it's not possible for us to discuss this further. Not because of your views, but how rigid you are on your beliefs and perspective.

With no hard feelings, I wish that time shows that you're right, all the time.

Have a nice day. :)

CamperBob2

(Shrug) If you want things to stay the same, both art and technology are bad career choices.

bayindirh

(Huh) What if you are in the field to advance it, and somebody steals your work and claims it as their own?

e.g.: https://news.ycombinator.com/item?id=44460552

CamperBob2

Bummer

spankibalt

[flagged]

pxc

It's true that intellectual property is a flawed and harmful mechanism for supporting creative work, and it needs to change, but I don't think ensuring a positive outcome is as simple as this. Whether or not such a power struggle between corporate interests benefits the public rather than just some companies will be largely accidental.

I do support intellectual property reform that would be considered radical by some, as I imagine you do. But my highest hopes for this situation are more modest: if AI companies are told that their data must be in the public domain to train against, we will finally have a powerful faction among capitalists with a strong incentive to push back against the copyright monopolists when it comes to the continuous renewal of copyright terms.

If the "path of least resistance" for companies like Google, Microsoft, and Meta becomes enlarging the public domain, we might finally begin to address the stagnation of the public domain, and that could be a good thing.

But I think even such a modest hope as that one is unlikely to be realized. :-\

karel-3d

That would render GPL and friends redundant too... copyleft depends on copyright.

CaptainFever

Copyleft nullifies copyright. Abolishing copyright and adding right to repair laws (mandatory source files) would give the same effect as everyone using copylefted licenses.

Der_Einzige

Yup.

My response to this whole thread is just “good”

Aaron Swartz is a saint and a martyr.

LtWorf

It will undermine it only for the rich owner of AI companies, not for everyone.

1970-01-01

The buried lede here is Antrhopic will need to attempt to explain to a judge that it is impossible to de-train 7M books from their models.

ranguna

How come? They just need to delete the model and train a new one without those books.

protocolture

Or they could be forced to settle a price for access to the books.

nickpsecurity

I'm hoping they fail to incentivize using legal, open, and/or licensed data. Then, thry might have to attempt to train a Claude-class model on legal data. Then, I'll have a great, legal model to use. :)

koolala

Anyone read the 2006 sci-fi book Rainbow's End that has this? It was set in 2025.

solfox

I was 100% thinking this. GREAT book. And they, too, shredded books to ingest them into the digital library! I don't recall if it was an attempt to bypass copyright though; in Rainbow's End, it was more technical, as it was easier to shred, scan the pieces, and reassemble them in software, rather than scanning each page.

Kim_Bruning

actual title:

"Anthropic cut up millions of used books to train Claude — and downloaded over 7 million pirated ones too, a judge said."

A not-so-subtle difference.

That said, in a sane world, they shouldn't have needed to cut up all those used books yet again when there's obviously already an existing file that does all the work.

CaptainFever

Yeah, I'm not sure if people realize that the whole reason they had to cut up the books was because they wanted to comply with copyright law. Artificial scarcity.

kube-system

The importance of acquiring the physical book was the transfer of compensation to the author.

Kim_Bruning

You're not wrong, but that's one heck of a way to do it. It involves the destruction of 7 million books, which ... I really don't quite see the "promotion of Progress of Science and useful Arts" in that.

m4rtink

Anyone else thinks destroying books for any reason is wrong ?

Or is it perhaps not an universal cultural/moral aspect ?

I guess for example in Europe people could be more sensitive to it.

lawlessone

If they aren't one of a kind and they digitally preserved them in some way i think i would be ok with it.

Saying that though there are tools for digitizing books that don't require destroying them

stackedinserter

There's nothing sacred about books. There are plenty of books that won't be missed if destroyed.

kbelder

I have purposefully destroyed one book in my life, in order to prevent anyone from reading it:

Man of Two Worlds by Brian Herbert.

...and I did the world a favor.

fivestones

Just looked this up and now I might read it, ostensibly because of the fact that you destroyed it.

kbelder

I'm sorry to do that to you, and may God have mercy on your soul.

1vuio0pswjnm7

Order on Fair Use

https://ia800101.us.archive.org/15/items/gov.uscourts.cand.4...

sidewndr46

So using the standard industry metrics for calculating the financial impact of piracy, this would equate to something like trillions of damages to the book publishing industry?

codedokode

If AI companies are allowed to use pirated material to create their products, does it mean that everyone can use pirated software to create products? Where is the line?

Also please don't use word "learning", use "creating software using copyrighted materials".

Also let's think together how can we prevent AI companies from using our work using technical measures if the law doesn't work?

whycome

But the AI used the content to learn how to copy and recreate it. Is ‘re-creation’ a better concept for us?

People already use pirated software for product creation.

Hypothetical:

I know a guy who learned photoshop on a pirated copy of Photoshop. He went on to be a graphic designer. All his earnings are ‘proceeds from crime’

He never used the pirated software to produce content.

timeon

So can we officially download pirated content to learn stuff now?

whycome

How often does a link get posted here of content that is behind a paywall? If you bypass it to read it, didny't you just learn via illegal content? I'm not sure where the "official" comes in, but it's clearly widely accepted.

If you watch a YouTube video to learn something and it's later taken down for using copyrighted images, you learned from illegal content.

timeon

"Official" comes in court cases for LLMs with "fair use" result. That is my point. I was not talking about de facto but about de jure. I wasn't hinting to the morality of the action but morality of the two-tier judgement.

southernplaces7

Sure, and I feel zero moral qualms about me or anyone else doing it. The vast majority of the shit flows flows from the other direction towards individuals and consumers when it comes to content delivery companies and worse still, software companies. Let's address that before wringing our hands about individual acts of "piracy", even at scale.

I could, right now in just a few minutes, go download a perfectly functional pirated copy of nearly any Adobe program, nearly any Microsoft program and a whole range of books and movies, yet I see zero real financial troubles affecting any of the companies behind these. All the contrary in fact.

rvnx

~1B USD in cash is the line where laws apply very differently

megaman821

Where are you reading that?

You are allowed to buy and scan books, and then used those scanned books to create products. I guess you are also allowed to pirate books and use the knowledge to create products if you are willing to pay the damages to the rights holders for copyright violations.

redcobra762

It's abusive and wrong to try and prevent AI companies from using your works at all.

The whole point of copyright is to ensure you're paid for your work. AI companies shouldn't pirate, but if they pay for your work, they should be able to use it however they please, including training an LLM on it.

If that LLM reproduces your work, then the AI company is violating copyright, but if the LLM doesn't reproduce your work, then you have not been harmed. Trying to claim harm when you haven't been due to some philosophical difference in opinion with the AI company is an abuse of the courts.

xdennis

> It's abusive and wrong to try and prevent AI companies from using your works at all.

People don't view moral issues in the abstract.

A better perspective on this is the fact that human individuals have created works which megacorps are training on for free or for the price of a single book and creating models which replace individuals.

The megacorps are only partially replacing individuals now, but when the models get good enough they could replace humans entirely.

When such a future happens will you still be siding with them or with individual creators?

whycome

> A better perspective on this is the fact that human individuals have created works which megacorps are training on for free or for the price of a single book and creating models which replace individuals.

Those damn kind readers and libraries. Giving their single copy away when they just paid for the single.

codedokode

Going to the library and reading a book takes hours while AI companies chew thousands of books per second. It's different scale.

DrillShopper

> The whole point of copyright is to ensure you're paid for your work.

No. The point of copyright is that the author gets to decide under what terms their works are copied. That's the essence of copyright. In many cases, authors will happily sell you a copy of their work, but they're under no obligation to do so. They can claim a copyright and then never release their work to the general public. That's perfectly within their rights, and they can sue to stop anybody from distributing copies.

redcobra762

We're operating under a model where the owner of the copyright has already sold their work. And while it's within their rights to stipulate conditions of the sale, they did not do that, and fair use of the work as governed under the laws the book was sold under encompasses its conversion into an LLM model.

If the author didn't want their work to be included in an LLM, they should not have sold it, just like if an author didn't want their work to inspire someone else's work, they should not have sold it.

seadan83

Yeah, this is part of the ruling. The judge decided that the usage was sufficiently transformative and thus fair use. The issue is the authors were selling their works and the company went to a black market instead.

DrillShopper

> fair use of the work as governed under the laws the book was sold under encompasses its conversion into an LLM model

If that were the case then this court case would not be ongoing

lcnPylGDnU4H9OF

That seems to be a misunderstanding of what's disputed. One fact that is disputed is whether or not the use of the work qualifies as fair use and the judge determined that it is because the result is sufficiently transformative. Another disputed fact is whether the books were acquired legally and the judge determined that they were not. The reason the case is still ongoing is to determine Anthropic's liability for illegally acquiring copies of the books, not to determine the legal status of the LLMs.

827a

Current copyright law is not remotely sophisticated enough to make determinations on AI fair use. Whether the courts say current AI use is fair is irrelevant to the discussion most people on this side would agree with: That we need new laws. The work the AI companies stole to train on was created under a copyright regime where the expectation was that, eh, a few people would learn from and be inspired from your work, and that feels great because you're empowering other humans. Scale does not amplify Good. The regime has changed. The expectations under what kinds of use copyright protects against has fundamentally changed. The AI companies invented New Horrors that no one could have predicted, Vader altered the deal, no reasonable artist except the most forward-thinking sci-fi authors would have remotely guessed what their work would be used for, and thus could never have conciously and fairly agreed to this exchange. Very few would have agreed to it.

codedokode

It is not wrong at all. The author decides what to do with their work. AI companies are rich and can simply buy the rights or hire people to create works.

I could agree with exceptions for non-commercial activity like scientific research, but AI companies are made for extracting profits and not for doing research.

> AI companies shouldn't pirate, but if they pay for your work, they should be able to use it however they please, including training an LLM on it.

It doesn't work this way. If you buy a movie it doesn't mean you can sell goods with movie characters.

> then you have not been harmed.

I am harmed because less people will buy the book if they can simply get an answer from LLM. Less people will hire me to write code if an LLM trained on my code can do it. Maybe instead of books we should start making applications that protect the content and do not allow copying text or making screenshots. ANd instead of open-source code we should provide binary WASM modules.

CaptainFever

> Maybe instead of books we should start making applications that protect the content and do not allow copying text or making screenshots.

https://en.wikipedia.org/wiki/Analog_hole

codedokode

That would be "circumvention of DRM".

redcobra762

If you reproduce the material from a work you've purchased then of course you're in violation of copyright, but that's not what an LLM does (and when it does I already conceded it's in violation and should be stopped). An LLM that doesn't "sell goods with movie characters" is not in violation.

And the harm you describe is not a recognized harm. You don't own information, you own creative works in their entirety. If your work is simply a reference, then the fact being referenced isn't something you own, thus you are not harmed if that fact is shared elsewhere.

It is an abuse of the courts to attempt to prevent people who have purchased your works from using those works to train an LLM. It's morally wrong.

CaptainFever

> It is worse than ineffective; it is wrong too, because software developers should not exercise such power over what users do. Imagine selling pens with conditions about what you can write with them; that would be noisome, and we should not stand for it. Likewise for general software. If you make something that is generally useful, like a pen, people will use it to write all sorts of things, even horrible things such as orders to torture a dissident; but you must not have the power to control people's activities through their pens. It is the same for a text editor, compiler or kernel.

Sorry for the long quote, but basically this, yeah. A major point of free software is that creators should not have the power to impose arbitrary limits on the users of their works. It is unethical.

It's why the GPL allows the user to disregard any additional conditions, why it's viral, and why the FSF spends so much effort on fighting "open source but..." licenses.

codedokode

To load a printed book into a computer one has to reproduce it in digital form without authorization. That's making a copy.

redcobra762

Making a digital copy of a physical book is fair use under every legal structure I am aware of.

When you do it for a transformative purpose (turning it into an LLM model) it's certainly fair use.

But more importantly, it's ethical to do so, as the agreement you've made with the person you've purchased the book from included permission to do exactly that.

seadan83

Per the ruling, the problem is the books were not purchased, they were downloaded from black market websites. It's akin to shoplifting, what you do later with the goods is a different matter.

Reasonable minds could debate the ethics of how the material was used, this ruling judged the usage was legal and fair use. The only problem is the material was in effect stolen.

stackedinserter

When I was young and poor I learned on pirated software. Do I owe Adobe, Microsoft and others a percentage of my today income?

pmdr

They've all done that, it should be obvious by now. Training on just freely available data only gets you so far.

shrubble

Let’s say my AI company is training an AI on woodworking books and at the end, it will describe in text and wireframe drawings (but not the original or identical photos) how to do a particular task.

If I didn’t license all the books I trained on, am I not depriving the publisher of revenue, given people will pay me for the AI instead of buying the book?

tpmoney

Copyright doesn’t cover facts and methods. It specifically covers creative expressions. That’s why patents are different from copyright. If you read some woodworking books and then write your own online tutorial about building a chair using the methods and procedures described in that book, it doesn’t matter that you now compete with the books that you used, provided you didn’t copy the creative elements. How much of a chair design is creative and how much is function is an ambiguous question that might still land you in court, but it won’t be over your right to make the tutorial in the first place.

As the judge noted in this ruling, copyright isn’t intended to protect authors from competition. Copyright doesn’t protect Rowling from other authors writing YA wizard books cutting into her revenue streams. Or from TV producers making YA wizard shows that reduce the demand for books. Copyright doesn’t protect the Tolkien estate from Terry Brooks, or Tracy Hickman or Margret Weiss reducing the demand for Tolkien fantasy by supplanting it with their own fantasies.

mathiaspoint

The same argument applies to someone who learned from the book and wrote an article explaining the idea to someone else.

mrkstu

If you paid a human author to do the same you’d be breaking no law. Learning is the point of books existing in the first place.

NoOn3

Humans learning, not machines learning is the point of books.

greenie_beans

something i've been trying to reconcile: i buy a cheap used book on biblio and i'm morally ok even though the writer doesn't get paid. but if i pirate the book, then i'm wrong for that because the writer doesn't get paid?

throwawayffffas

The article doesn't say who is suing them. Is it a class action? How many of these 7M pirated books have they written? Is it publishing houses? How many of these books are relevant in this judgement?

jimnotgym

Hang on, it is OK under copyright law to scan a book I bought second hand, destroy the hard copy and keep the scan in my online library? That doesn't seem to chime with the copyright notices I have read in books.

badlibrarian

First sale doctrine gives the person who sold the book you bought the right to sell it to you. Fair Use permits you to scan your copy, used or new. It's your book, you can destroy it. But you have to delete your digital copy if you sell it or give it away. And you can't distribute your digital copy.

kube-system

Fair use can be a pretty gray area and details matter, but copying for personal use is frequently okay.

> That doesn't seem to chime with the copyright notices I have read in books.

You shouldn't get your legal advice from someone with skin in the game.

BeetleB

> That doesn't seem to chime with the copyright notices I have read in books.

I used to get scared by such verbiage. Courts ruled decades ago that many of those uses are actually permitted, under very common conditions (e.g. not distributing, etc). Yes, you totally can photocopy a book you own, for your own purposes.

codedokode

> "Like any reader aspiring to be a writer, Anthropic's LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different," he wrote.

But this analogy seems wrong. First, LLM is not a human and cannot "learn" or "train" - only human can do it. And LLM developers are not aspiring to become writers and do not learn anything, they just want to profit by making software using copyrighted material. Also people do not read millions of books to become a writer.

CaptainFever

> But this analogy seems wrong. First, LLM is not a human and cannot "learn" or "train" - only human can do it.

The analogy refers to humans using machines to do what would already be legally if they did it manually.

> And LLM developers are not aspiring to become writers and do not learn anything, they just want to profit by making software using copyrighted material.

[Citation needed], and not a legal argument.

> Also people do not read millions of books to become a writer.

But people do hear millions of words as children.

codedokode

> But people do hear millions of words as children.

At a rate 1000 words/day it takes 3 years to hear a million words. Also "million words" is not equal to "million books". Humans are ridiculously efficient in learning compared to LLMs.

dathinab

as far as I understand while training on books is clearly not fair use (as the result will likely hurt the lively hood of authors, especially not "best of the best" authors).

as long as you buy the book it still should be legal, that is if you actually buy the book and not a "read only" eBook

but the 7_000_000 pirated books are a huge issue, and one from which we have a lot of reason to believe isn't just specific to Anthropic

asadotzler

Buying a copy of a book does not give you license to take the exact content of that book, repackage it as a web service, and sell it to millions of others. That's called theft.

lvl155

It’s marginally better than Meta torrenting z-lib.

burnt-resistor

1980's: Johnny No. 5 need input!

2020's: (Steals a bunch of books to profit off acquired knowledge.)

nickpsecurity

Buying, scanning, and discarding was in my proposal to train under copyright restrictions.

You are often allowed to nake a digital copy of a physical work you bought. There are tons of used, physical works thay would be good for training LLM's. They'd also be good for training OCR which could do many things, including improve book scanning for training.

This could be reduced to a single act of book destruction per copyrighted work or made unnecessary if copyright law allowed us to share others' works digitally with their licensed customers. Ex: people who own a physical copy or a license to one. Obviously, the implementation could get complex but we wouldn't have to destroy books very often.

asadotzler

[flagged]

nickpsecurity

That's true and was the distinction I was making. In my proposal, and maybe part of what Anthropic did, the digitized copies are used as training data for a new work, the model. That reduces the risk of legal rulings against using the copyrighted works.

From there, the cases would likely focus on whether that fits in established criteria for digitized copies, whether they're allowed in the training process itself, and the copyright status of the resulting model. Some countries allow all of that if you legally obtained the material in the first place. Also, they might factor whether it's for commercial use or not.

motbus3

It is shocking how courts have being ruling towards the benefits of ai companies despite the obvious problem of allowing automatic plagiarism

jobs_throwaway

Information wants to be free

NoOn3

Then why do they sell their services instead of putting the model in open source?

kristofferR

Not really, plagiarism is not a legal concept.

buyucu

When Aaron Schwartz did it, he ended up dying.

russell_h

The title is clearly meant to generate outrage, but what is wrong with cutting up a book that you own?

timewizard

[flagged]

kube-system

They very clearly had a reason.

jobs_throwaway

poverty mindset. We can make more books, and now these copies contribute to a corpus of knowledge that far more people benefit from

justinrubek

Wasteful mindset. They don't need the books, they need the data. They should never have been printed if they were going to he destroyed.

timewizard

People who pay Anthropic you mean. There is no benefit. And only the owner can make more books.

Fake altruistic mindset. Super sociopathic.

jobs_throwaway

I use Claude without paying, that's a benefit

And "And only the owner can make more books" is trivially false

Keep crying about this nothingburger

reverendsteveii

under the DMCA the minimum penalty for an illegally downloaded file is $750 (https://copyrightresource.uw.edu/copyright-law/dmca/)

"Anthropic had no entitlement to use pirated copies for its central library...Creating a permanent, general-purpose library was not itself a fair use excusing Anthropic's piracy." --- the ruling

If they committed piracy 7 million times and the minimum fine for each instance is $750 million then the law says that anthropic is liable for $5.25 billion. I just want it to be out there that they definitely broke the law and the penalty is a minimum $5.25 billion in fines according to the law, this way when none of this actually happens we at least can't pretend we didn't know.

ruffrey

Two of the top AI companies flouted ethics with regard to training data. In OpenAI's case, the whistleblower probably got whacked for exposing it.

Can anyone make a compelling argument that any of these AI companies have the public's best interest in mind (alignment/superalignment)?

letmetrymaybe

[dead]

adolph

  Alsup detailed Anthropic's training process with books: The OpenAI rival 
  spent "many millions of dollars" buying used print books, which the 
  company or its vendors then stripped of their bindings, cut the pages, 
  and scanned into digital files.

I've noticed an increase in used book prices in the recent past and now wonder if there is an LLM effect in the market.

carlosjobim

If ingesting books into an AI makes Anthropic criminals, then Google et al are also criminals alike for making search indexes of the Internet. Anything published online is equally copyrighted.

riskable

Exactly! If Anthropic is guilty of copyright infringement for the mere act of downloading copyrighted books then so is Google, Microsoft (Bing), DuckDuckGo, etc. Every search engine that exists downloads pirated material every day. They'd all be guilty.

Not only that but all of us are guilty too because I'm positive we've all clicked on search results that contained copyrighted content that was copied without permission. You may not have even known it was such.

Remember: Intent is irrelevant when it comes to copyright infringement! It's not that kind of law.

Intent can guide a judge when they determine damages but that's about it.

kristofferR

Yeah, we can all agree that ingesting books is fair use and transformative, but you gotta own what you ingest, you can't just pirate it.

I can read 100 books and write a book based on the inspiration I got from the 100 books without any issue. However, if I pirate the 100 books I've still committed copyright infringement despite my new book being fully legal/fair use.

carlosjobim

I disagree that it has anything to do with copyright. It is at most theft. If I steal a bunch of books from the library, I haven't committed any breach of copyright.

greenavocado

Should have listened to those NordVPN ads on YouTube

neo__

Hopefully they were all good books at least.

pyman

they pirated the best ones, according to the authors

k__

So, how should we as a society handle this?

Ensure the models are open source, so everyone can use them, as everyones data is in there?

Close those companies and force them to delete the models, as they used copyright material?

dandanua

Same did Meta and probably other big companies. People who praise AGI are very short sighted. It will ruin the world with our current morals and ethics. It's like a nuclear weapon in the hands of barbarians (shit, we have that too, actually).

Lionga

Based on the fact people went to jail for downloading some music or movies, this guy will face a lifetime in prison for 7 million books that he then used for commercial profit right?

Right guys we don't have rules for thee but not for me in the land of the free?

freejazz

I'm curious - do the people here who think copyright shouldn't exist also think trademark shouldn't exist?

ChrisArchitect

Two week old news.

Some previous discussions:

https://news.ycombinator.com/item?id=44367850

https://news.ycombinator.com/item?id=44381838

https://news.ycombinator.com/item?id=44381639

IOT_Apprentice

If Anthropic is funded by Amazon, they should have just asked Amazon for unlimited download of EVERY book in the Amazon book store, and all audio-books as well. It certainly would be faster than buying one copy of each and tearing it apart.

NHQ

The farce of treating a corporation as an individual precludes common sense legal procedure to investigate people who are responsible for criminal action taken by the company. Its obviously premeditated and in all ways an illicit act knowingly perpetrated by persons. The only discourse should be about upending this penthouse legalism.

tpmoney

The “farce” of treating a corporation as a legal individual is the reason you can have this case in the first place. Otherwise the authors would have had to discover and individually sue each specific individual in the company for each specific claim. They would have to find the specific individual that downloaded their specific book and sue that person. Then they would need to find the specific individual that digitized their specific book and sue that person. Then they would need to find the specific person that loaded that digital copy into an AI model and sue that person. And on and on for each alleged act of infringement.

Or we could recognize that’s silly when we’re talking about a group of people acting in concert and treat them as a single entity for the purpose of alleged crimes. Which is what we do when we treat a corporation as an individual for legal purposes.

NHQ

The irony is that actually litigating copyright law would lead to the repeal of said copyright law. And so in all cases of backwaters laws that are used to "protect interests" of "corporations" yet criminalize petty individual cases.

This of course cannot be allowed to happen, so the the legal system is just a limbo, a bar which regular individuals must strain to pass under but that corporations regularly overstep.

2OEH8eoCRo0

I've begun to wonder if this is why some large torrent sites haven't been taken down. They are essentially able to crowdsource all the work. There are some users who spend ungodly amounts of time and money on these sites that I suspect are rich industry benefactors.

damnesian

seems like the "mis" is missing from the name.

nsonha

no one said anything when Google did it way before LLM is a thing

outside1234

So if you incorporate you can do whatever you want without criminal charges?

godelski

The solution has always been: show us the training data.

As a researcher I've been furious that we publish papers where the research data is unknown. To add insult to injury, we have the audacity to start making claims about "zero-shot", "low-shot", "OOD", and other such things. It is utterly laughable. These would be tough claims to make *even if we knew the data*, simply because of its size. But not knowing the data, it is outlandish. Especially because the presumptions are "everything on the internet." It would be like training on all of GitHub and then writing your own simple programming questions to test an LLM[0]. Analyzing that amount of data is just intractable, and we currently do not have the mathematical tools to do so. But this is a much harder problem to crack when we're just conjecturing and ultimately this makes interoperability more difficult.

On top of all of that, we've been playing this weird legal game. Where it seems that every company has had to cheat. I can understand how smaller companies turn to torrenting to compete, but when it is big names like Meta, Google, Nvidia, OpenAI (Microsoft), etc it is just wild. This isn't even following the highly controversial advice of Eric Schmidt "Steal everything, then if you get big, let the lawyers figure it out." This is just "steal everything, even if you could pay for it." We're talking about the richest companies in the entire world. Some of the, if not the, richest companies to ever exist.

Look, can't we just try to be a little ethical? There is, in fact, enough money to go around. We've seen unprecedented growth in the last few years. It was only 2018 when Apple became the first trillion dollar company, 2020 when it became the second two trillion, and 2022 when it became the first three trillion dollar company. Now we have 10 companies north of the trillion dollar mark![3] (5 above $2T and 3 above $3T) These values have exploded in the last 5 years! It feels difficult to say that we don't have enough money to do things better. To at least not completely screw over "the little guy." I am unconvinced that these companies would be hindered if they had to broker some deal for training data. Hell, they're already going to war over data access.

My point here is that these two things align. We're talking about how this technology is so dangerous (every single one of those CEOs has made that statement) and yet we can't remain remotely ethical? How can you shout "ONLY I CAN MAKE SAFE AI" while acting so unethically? There's always moral gray areas but is this really one of them? I even say this as someone who has torrented books myself![4] We are holding back the data needed to make AI safe and interpretable while handing the keys to those who actively demonstrate that they should not hold the power. I don't understand why this is even that controversial.

[0] Yes, this is a snipe at HumanEval. Yes, I will make the strong claim that the dataset was spoiled from day 1. If you doubt it, go read the paper and look at the questions (HuggingFace).

[1] https://www.theverge.com/2024/8/14/24220658/google-eric-schm...

[2] https://en.wikipedia.org/wiki/List_of_public_corporations_by...

[3] https://companiesmarketcap.com/

[4] I can agree it is wrong, but can we agree there is a big difference between a student torrenting a book and a billion/trillion dollar company torrenting millions of books? I even lean on the side of free access to information, and am a fan of Aaron Swartz and SciHub. I make all my works available on ArXiv. But we can recognize there's a big difference between a singular person doing this at a small scale and a huge multi-national conglomerate doing it at a large scale. I can't even believe we so frequently compare these actions!

2OEH8eoCRo0

Most of the comments missed the point. It's not that they trained on books, it's that they pirated the books.

Uhhrrr

From Vinge's "Rainbow's End":

> In fact this business was the ultimate in deconstruction: First one and then the other would pull books off the racks and toss them into the shredder's maw. The maintenance labels made calm phrases of the horror: The raging maw was a "NaviCloud custom debinder." The fabric tunnel that stretched out behind it was a "camera tunnel...." The shredded fragments of books and magazine flew down the tunnel like leaves in tornado, twisting and tumbling. The inside of the fabric was stitched with thousands of tiny cameras. The shreds were being photographed again and again, from every angle and orientation, till finally the torn leaves dropped into a bin just in front of Robert. Rescued data. BRRRRAP! The monster advanced another foot into the stacks, leaving another foot of empty shelves behind it.

microtherion

Yes, I was thinking of this passage as well. The technology does not seem to have advanced to this particular point yet.

Zufriedenheit

Maybe to give something back to the pirates, Anthropic could upload all the books they have digitized to the archive? /s

randomNumber7

I will never feel bad again for learning from copied books /S

aaron695

Good, this is what Aaron Swartz was fighting for.

Against companies like Elsevier locking up the worlds knowledge.

Authors are no different to scientists, many had government funding at one point, and it's the publishing companies that got most of the sales.

You can disagree and think Aaron Swartz was evil, but you can't have both.

You can take what Anthropic have show you is possible and do this yourself now.

isohunt: freedom of information

stackedinserter

Everybody that wants to train an LLM, should buy every single book, every single issue of a magazine or a newspaper, and personally ask every person that ever left a comment on social media. /s

If I was China I would buy every lawyer to drown western AI companies in lawsuits, because it's an easy way to win AI race.

1oooqooq

[flagged]

pyman

He downloaded millions of academic articles and the government charged him with multiple felonies.

The difference is, Aaron Swartz wasn't planning to build massive datacenters with expensive Nvidia servers all over the world.

mikewarot

>the government charged him with multiple felonies.

This was the result of a cruel and zealous overreach by the prosecutor to try to advance her political career. It should never have gone that far.

The failure of MIT to rally in support of Aaron will never be forgiven.

pyman

I agree

omnimus

It's even worse considering all he downloaded was in public domain so it was much less problematic considering copyright.

Lesson is simple. If you want to break a law make sure it is very profitable because then you can find investors and get away with it. If you play robin hood you will be met with a hammer.

booleandilemma

[flagged]

famahar

Make sure you have a few billion dollars ready so you can pay a few million on the lawsuits. A volcano getting a cup of water poured into it.

spandrew

Amazon has been doing this since the 2000's. Fun fact: This is how AWS came about; for them to scale its "LOOK INSIDE!" feature for all the books it was hoovering in an attempt to kill the last benefit the bookstore had over them.

Ie. This is not a big deal. The only difference now is ppl are rapidly frothing to be outraged by the mere sniff of new tech on the horizon. Overton window in effect.

dehrmann

The important parts:

> Alsup ruled that Anthropic's use of copyrighted books to train its AI models was "exceedingly transformative" and qualified as fair use

6gvONxR4sf7o

You skipped quotes about the other important side:

> But Alsup drew a firm line when it came to piracy.

> "Anthropic had no entitlement to use pirated copies for its central library," Alsup wrote. "Creating a permanent, general-purpose library was not itself a fair use excusing Anthropic's piracy."

That is, he ruled that

- buying, physically cutting up, physically digitizing books, and using them for training is fair use

- pirating the books for their digital library is not fair use.

pier25

> buying, physically cutting up, physically digitizing books, and using them for training is fair use

So Suno would only really need to buy the physical albums and rip them to be able to generate music at an industrial scale?

conradev

Yes! Training and generation are fair use. You are free to train and generate whatever you want in your basement for whatever purpose you see fit. Build a music collection, go ham.

If the output from said model uses the voice of another person, for example, we already have a legal framework in place for determining if it is infringing on their rights, independent of AI.

Courts have heard cases of individual artists copying melodies, because melodies themselves are copyrightable: https://www.hypebot.com/hypebot/2020/02/every-possible-melod...

pier25

> Yes!

But Suno is definitely not training models in their basement for fun.

They are a private company selling music, using music made by humans to train their models, to replace human musicians and artists.

We'll see what the courts say but that doesn't sound like fair use.

conradev

My understanding is that Suno does not sell music, but instead makes a tool for musicians to generate music and sells access to this tool.

The law doesn't distinguish between basement and cloud – it's a service. You can sell access to the service without selling songs to consumers.

pier25

That's like arguing that a restaurant doesn't sell food because it sells the service of cooking it.

conradev

The restaurant is not responsible for E. coli if it’s found, are they? Just cooking it out of the food

Suno can’t prevent humans from copying other humans, it can only make sure that the direct output of its system isn’t infringing.

Greed

conradev

Anything remotely beyond that and we have teams of humans adjudicating specific cases: https://library.mi.edu/musiccopyright/currentcases

Greed

tekknik

Look up what a cloud kitchen is.

johnnyanmac

That doesn't seem to track in my mind. So you can't sell music but you can sell 10 second snippets of music you pirated? It doesn't math out.

But i guess I'm not surprised that 2025 has little respect for artists.

immibis

conradev

I can’t buy the music you generate using Suno, though, unless you take action to list it somewhere for sale.

immibis

You also can't buy a massage I receive. Does that mean I was sold access to a massage generation tool instead of a massage?

conradev

Yes! You were sold a service and not a good, and you cannot copyright the act of the massage itself.

freejazz

That's because a massage is not a copyrightable expression (dance is), not because it's sold as a service.

freejazz

conradev

When a general computer using agent recreates songs in Logic Pro in high fidelity, then what?

It’s called Fair Use for a reason – we let humans Use things generally and ask them to be Fair.

Or we can go in the direction of movies and TV where screenshots of protected content show up blank on my iPhone. Just in case someone wanted to, god forbid, clip the show.

freejazz

>It’s called Fair Use for a reason – we let humans Use things generally and ask them to be Fair.

DAWs are tools made to create music, generally. They do not contain banks of copyrighted materials to which the user ultimately pulls the copying "trigger" (that's the system I described).

I hope that helps.

ricardobeat

freejazz

Someone responded and said "Why not DAWs, then?" The answer is because a DAW is not that kind of service or machine.

>t’s easy to fall back to known concepts to frame new things, but that is not accurate. LLMs do not hold a “banks of copyrighted materials”,

As an aside. That's clearly not true in some models given that in a number of the cases, the plaintiffs can recreate their works verbatim.

conradev

  DAWs are tools made to create music, generally. They do not contain banks of copyrighted materials to which the user ultimately pulls the copying "trigger" (that's the system I described).

Only so many artists have the patience to make each drum from scratch.

freejazz

conradev

I come to copyright threads because I think Section 1201 of the DMCA is in direct violation of the Hacker Manifesto.

dragonwriter

freejazz

conradev

I'm genuinely trying to engage and I'm curious where my preconceptions are "fundamentally wrong" versus not understanding "what makes a dance copyrightable where a massage is not".

Where are you on the continuum? Regarding training an AI model in my basement on purchased music, do you think I should:

- Not be allowed to train it

- Not be allowed to run it

- Not be allowed to share outputs from it anywhere

- Not be allowed to share outputs from it publicly

- Not be allowed to share outputs from it commercially

- Not be allowed to share its weights for others to run it

Or are you primarily focused on the current legal precedent?

freejazz

>I'm genuinely trying to engage and I'm curious where my preconceptions are

>Or are you primarily focused on the current legal precedent?

>Where are you on the continuum? Regarding training an AI model in my basement on purchased music, do you think I should:

pyman

What does "fair use" even mean in a world where models can memorise and remix every book and song ever written? Are we erasing ownership?

The problem is, copyright law wasn't written for machines. It was written for humans who create things.

So we're stuck in a grey zone: the input is human, the output is AI generated, and the law doesn't know what to do with that.

For me the real debate is: Do we need new rules for non-human creation?

markhahn

when you buy a book, you are not acceding to a license to only ever read it with human eyes, forbearing to memorize it, never to quote it, never to be inspired by it.

mwarkentin

pyman

You are comparing AI to humans, but they're not the same. Humans don't memorise millions of copyrighted work and spit out similar content. AI does that.

Memorising isn't wrong but when machines memorise at scale and the people behind the original work get nothing, it raises big ethical questions.

The law hasn't caught up.

bongodongobob

pyman

Again, you are comparing machines with humans. We're built for depth, not scale. Machines are built for scale, not depth.

I also play the guitar, and it took me 10 years to learn 30 or 40 songs. So I don't see how anyone can learn 7 million songs in a couple of minutes.

bongodongobob

I have learned 100s of songs in a summer for various fill in gigs. Most music is extremely similar. You don't need to learn every song in existence to write suno pop.

pyman

Impressive. I rehearsed for a month before a gig where I played 12 songs. So, unfortunately, I can't relate.

immibis

And those bands can successfully sue you for that. Especially if you sell it for money. Double especially if your sales of their songs displace them in the market.

belorn

Most AI seems much better at reproducing a semi-identical copies of an original work than existing video/audio encoders.

GuB-42

Now, what if instead of training myself using real instruments, I train my AI and do the same. Is it different?

It is complicated, but there are many arguments in favor of fair use, probably more than they are against but as you say, let's the courts decide.

But in any case, piracy is illegal in every case. As a human, it is illegal for me to use pirate copies, whether it is for training myself as a musician, for training my AI, or for simply listening.

itronitron

If it's fair use to train a model, that doesn't necessarily imply that the model can be legally used to generate anything.

pier25

I've been reading a bit more about this. The training might not be considered fair use if it's not considered transformative.

Claude has been considered transformative given it's not really meant to generate books but Suno or Midjourney are absolutely in another category.

markhahn

really? so Suno or Midjourney can produce literal copies of works they were trained on?

bongodongobob

protocolture

Well there was that legal company who trained an LLM on their oppositions legal documents and then generated their own. I dont think inputs or outputs were ruled legal in that regard.

But as long as the model isnt outputting infringing works theres not really any issue there either.

make3

this is funny and potentially accurate

kelnos

Not sure we can infer that (or anything) about Suno from this ruling. The judge here said that Anthropic's usage was extremely transformative. Would Suno's also be considered that way?

Anthropic doesn't take books and use them to train a model that is intended to generate new books. (Perhaps it could do that, to some extent, but that's no its [sole] purpose.)

But Suno would be taking music to train a model in order to generate new music. Is that transformative enough? We don't know what a judge thinks, at least not yet.

jbverschoor

Same how it works in the Netherlands.

theteapot

Yes.

pier25

Actually it remains to be seen.

If you read the ruling, training was considered fair use in part because Claude is not a book generation tool. Hence it was deemed transformative. Definitely not what Suno and Udio are doing.

ohdeargodno

Do keep in mind though: this is only for the wealthy. They're still going to send the Pinkertons at your house if you dare copy a Blu-ray.

kbelder

No, because they can just play the album for the AI to learn. AI training can be set up to exploit the analog hole. Same with images/movies

nilamo

> They're still going to send the Pinkertons at your house if you dare copy a Blu-ray.

Hey woah now, that's a Hasbro play, not a Disney one.

zerocrates

With some minor exceptions, CDs don't have copy protection.

FateOfNations

Minor exception: https://en.wikipedia.org/wiki/Sony_BMG_copy_protection_rootk...

burnt-resistor

So not only did they pirate works but they destroyed possibly collectible physical copies too. Kafkaesque.

bigyabai

Google set the precedent for this with an even less transformative use case: https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

jonas21

As they mentioned, the piracy part is obvious. It's the fair use part that will set an important precedent for being able to train on copyrighted works as long as you have legally acquired a copy.

wood_spirit

Cue physical books being licensed not sold in the futur with restricted agreements …

mormegil

See first-sale doctrine <https://en.wikipedia.org/wiki/First-sale_doctrine>

pier25

Also music, videos, photos, etc.

throwawayffffas

So all they have to do is go and buy a copy of each book they pirated. They will have ceased and desisted.

superfrank

Found it: https://www.nbcnews.com/tech/tech-news/federal-judge-rules-c...

irthomasthomas

Is copyright in America different to Britain? There, it is legal to download books you don't own. Only distribution is a crime, which most torrenters break by seeding.

throwawayffffas

I think it's very similar in both countries, but you have got it wrong. Downloading a book without permission is copyright infringement in both countries, regardless of whether you distribute it.

In the UK it's a criminal offense if you distribute a copyrighted work with the intent to make gain or with the expectation that the owner will make a loss.

Gain and loss are only financial in this context.

Meaning that in both countries the copyright owner can sue you for copyright infringement.

rahimnathwani

What do you mean by 'it is legal'?

Do you mean:

A) It's not a criminal offence?

B) The copyright owner cannot file a civil suit for damages?

C) Something else?

irthomasthomas

> Only distribution is a crime

rahimnathwani

  We will have a trial on the pirated copies used to create Anthropic’s central library and the resulting damages, actual or statutory (including for willfulness).

Aeolun

There can always be a trial, even if nothing was done to warrant it.

I think the distinction between civil and criminal trials is smaller in my home country. The fact that there is a trial at all implies that someone commited a ‘crime’.

throwawayffffas

Only distribution with the intent to make money is a crime. If you are doing it for free you are not criminally liable. Unless I am missing something.

codedokode

Any distribution of copyrighted material can cause you a big trouble.

kelnos

It's not a crime in the US, either, I believe, but you can certainly be sued in civil court for it.

freejazz

They also argued that they in no way could ever actually license all the materials they ingested

dmd

I love this argument so much. "But judge, there's no way I could ever afford to buy those jewels, so stealing them must be OK."

AnthonyMouse

homebrewer

HelloImSteven

bombcar

What’s more interesting to me is if you can hire someone in the US to buy the book for you, cut the spine off with a bandsaw, and send you the scans and destroy the pages afterwards.

johnnyanmac

They can. That's how any media service from Spotify to Netflix to Audible have to do things.

They simply don't want to and think they can skirt the law while the judges catch up.

kelnos

What do you mean by "negotiating"? They can buy the books in paperback form from Amazon. And for e-books available for sale without DRM, they get to skip the cutting and scanning part.

throwawayffffas

I don't even think their argument is about the money, I think it's more like we couldn't possibly find all these works in any other practical way.

exe34

That's right, so I can't individually discuss terms with each and every media creator, so from now on, I can just pirate everything.

Aeolun

This is literally why a lot of people pirate content, yes. It’s pretty much always the only way to obtain the content, even if you are otherwise fine with paying for it.

johnnyanmac

Yes, and it's technically copyright infringement, even for private use. It's just that damages and enforcement is in feasible.

But if you tried to open a black market selling that media: you'd be hunted down to the ends of the earth. Or to China/North Korea, at least.

Aeolun

> But if you tried to open a black market selling that media

Why would you ever do that? Nobody would buy it. They'd just get it in the same place you did.

bombcar

You’d be surprised, almost any flea market/swap meet will still have bootleg DVDs and “PlayStation 2s” preloaded with a billion games.

Everyone can out a disk in a DVD player; sailing the high seas is much trickier.

johnnyanmac

Undercut the competition, mainly. People will do a lot of things for a decent discount.

AnthonyMouse

Needing a copy of one book you're going to spend a week reading has a lot less overhead than needing a copy of every book that you're going to process with a computer in bulk.

recursive

I like to glance at the cover art. I can do ten per second when I really get into my flow state. Sometimes I read them also, but that's incidental.

AnthonyMouse

If you go to the book store and glance at all the cover art without buying any of them, do you expect to be sued for this?

freejazz

If you do that and reproduce the covers or the protected elements thereof, you should absolutely expect to be sued.

AnthonyMouse

johnnyanmac

Probably not sued, but it's possible Le to be. They'd probably just fire you instead.

Having access to a camera doesn't permit you to take the footage home to review.The company still owns that footage, after all.

Now, if you had your own camera recording everything at your desk... I guess that falls into one or two party states.

freejazz

Re-read my comment: "If you do that and reproduce the covers or the protected elements thereof"

This conversation becomes incredibly unenjoyable when you pull rhetorical techniques like completely ignoring the entirety of what I wrote.

freejazz

> and that kind of efficiency loss is the sort of thing fair use exists to prevent.

No it's not. And you ever heard of a publishing house? They don't need to negotiate with every single author individually. That's preposterous.

johnnyanmac

>They don't need to negotiate with every single author individually.

freejazz

The publishers could license the works in bulk, without the need for Anthropic to deal with the individual authors. Both sides pointed this out.

AnthonyMouse

It kind of is though?

It's not the only reason fair use exists, but it's the thing that allows e.g. search engines to exist, and that seems pretty important.

> And you ever heard of a publishing house? They don't need to negotiate with every single author individually. That's preposterous.

There are thousands of publishing houses and millions of self-published authors on top of that. Many books are also out of print or have unclear rights ownership.

freejazz

>It kind of is though?

No, it kinda isn't. Show me anything that supports this idea beyond your own immediate conjecture right now.

>It's not the only reason fair use exists, but it's the thing that allows e.g. search engines to exist, and that seems pretty important.

>There are thousands of publishing houses and millions of self-published authors on top of that. Many books are also out of print or have unclear rights ownership.

Okay, and? How many customers does Microsoft bill on a monthly basis?

AnthonyMouse

> Show me anything that supports this idea beyond your own immediate conjecture right now

> Okay, and? How many customers does Microsoft bill on a monthly basis?

const_cast

I think the notion that some sort of god-given right to "scale" can absolve you of laws is preposterous.

If your business model is not economically sustainable in the current legal landscape you operate in, the correct outcome is you go out of business.

freejazz

AnthonyMouse

> No, that's not the most important factor. The transformative factor is the most important.

We're talking about the temporary copies they make during training. Those aren't being distributed to anyone else.

> So? That doesn't make you right.

> By that logic, anything that's expensive becomes a fair use. It's facially ridiculous.

Anything with unreasonably high transaction costs. Why is that ridiculous? It doesn't exempt any of the normal stuff like an individual person buying an individual book.

> They don't need to get every single book ever written.

freejazz

>We're talking about the temporary copies they make during training. Those aren't being distributed to anyone else.

>Anything with unreasonably high transaction costs. Why is that ridiculous? It doesn't exempt any of the normal stuff like an individual person buying an individual book.

Lol dude, it was your example, not mine. They do not need every single book. They aren't being sued over every single book anyway, so it's totally besides the point.

zoklet-enjoyer

Did they really steal if they didn't deprive anyone of their copy? I don't think copying is theft.

hadlock

It's copyright infringement, which is not theft, they're legally distinct in the eyes of the law. This is partly why the "you wouldn't download a car" copyright ads were so widely mocked.

__MatrixMan__

Fun fact, they didn't have the rights to use the font they used for those commercials: https://news.ycombinator.com/item?id=43775926

gghffguhvc

Or the music. It was originally made as a one off for a film festival. Movie industry defended the lawsuit over the music.

axus

Agreed, the judge should avoid slang or even commonly accepted synonyms in an official ruling. The charge is not for theft.

Substitute infringement for theft.

fortran77

It's fine that you think that way. But this is a discusion of the laws of the United States of America and ruling by American courts, not a discussion of your own legal theories.

hnlmorg

The GP isn’t talking about some edge case legal dilemma that requires a lawyer or judge to comment. It’s already widely documented that copyright infringement is legally distinct from theft.

badlibrarian

"Tell it to the Judge..."

kjkjadksj

You may not think it is but the law does.

buildbot

The law says it’s copyright infringement, not theft.

dragonwriter

> So all they have to do is go and buy a copy of each book they pirated.

> They will have ceased and desisted.

rockemsockem

> So all they have to do is go and buy a copy of each book they pirated.

For anyone else who wants to do the same thing though this is likely all they need to do.

tzs

Generally you don't want laws to work that way. You want to set the penalties so that they discourage violating the law.

Setting the penalty to what it would have cost to obey the law in the first place does the opposite.

AnthonyMouse

That's for criminal laws where prosecutorial discretion can then (in principle) be used in borderline cases to prevent unjust outcomes.

badlibrarian

Statutory damages were written into the first federal copyright law in 1790, and earlier in state law (specified in Pounds because the dollar hadn't been invented yet).

AnthonyMouse

> That is, he ruled that

> - buying, physically cutting up, physically digitizing books, and using them for training is fair use

> - pirating the books for their digital library is not fair use.

That seems inconsistent with one another. If it's fair use, how is it piracy?

MrJohz

These are two separate actions that Anthropic did:

* They downloaded a massive online library of pirated books that someone else was distributing illegally. This was not fair use.

* They then digitised a bunch of books that they physically owned copies of. This was fair use.

AnthonyMouse

Can you point me to the US Supreme Court case where this is existing law?

MrJohz

AnthonyMouse

> If you give a book to a friend, they are now the owner of that book and can do what they like with it.

We're talking about lending rather than ownership transfers, though of course you could regard lending as a sort of ownership transfer with an agreement to transfer it back later.

> If you photocopy that book and give them the photocopy, they are not the owner of the book and you have reproduced it without permission.

But then the question is whether the copy is fair use, not who the owner of the original copy was, right? For example, you can make a fair use photocopy of a page from a library book.

> They were not the legal owners of the digital books, which means they can get prosecuted for copyright infringement.

Even if the copy they make falls under fair use and the person who does own that copy of the book has no objection to their doing this?

MrJohz

You are talking about lending, but I'm not really sure why because it's not that relevant to the case.

> Even if the copy they make falls under fair use and the person who does own that copy of the book has no objection to their doing this?

op00to

AnthonyMouse

freejazz

Do you think that Anthropic did not have the option of getting legal advice before they decided to pirate libraries of books for their own commercial purposes?

With that in mind, what do you think the inconsistency is between ReDigi and Sony?

cusaitech

The judge said they can train however I believe the judge did not make any ruling regarding model outputs

MrJohz

Thanks for the clarification!

icelancer

> You skipped quotes about the other important side:

He said:

> It was always somewhat obvious that pirating a library would be copyright infringement.

jasonlotito

From my understanding:

> pirating the books for their digital library is not fair use.

"Pirating" is a fuzzy word and has no real meaning. Specifically, I think this is the cruz:

> without adding new copies, creating new works, or redistributing existing copies

Essentially: downloading is fine, sharing/uploading up is not. Which makes sense. The assertion here is that Anthropic (from this line) did not distribute the files they downloaded.

AlotOfReading

It's a bit surprising that you can suddenly download copyrighted materials for personal use and and it's kosher as long as you don't share them with others.

jasonlotito

> the numerous media industry lawsuits against individuals that only mention downloading,

I never saw any of these. All the cases I saw were related to people using torrents or other P2P software (which aren't just downloading). These might exist, but I haven't seen them.

> It's a bit surprising that you can suddenly download copyrighted materials for personal use and it's kosher as long as you don't share them with others.

Every click on a link is a risk of downloading copyrighted material you don't have the rights to.

Hence my confusion.

I should note: I'm not arguing from the perspective of whether it's morally or ethically right. Only that even in the context of this thread, things are phrased that aren't clear.

AlotOfReading

I just checked first individual suit I could find, which was BMG v. Gonzalez. She used P2P, but the case was specifically about her downloading, not redistributing.

travoc

Most P2P tools work in a way where you cannot download without simultaneously uploading.

AlotOfReading

Gonzalez is a ruling about downloading even though there was also distribution.

codedokode

Downloading and using pirated software in a company is fine then as long as it is not shared outside? If what you describe is legal it makes no sense to pay for software.

pyrale

sci-hub suddenly becomes legal if all researchers adhere to one big company, apparently.

After all, illegally downloading research papers in order to write new ones is highly transformative.

jasonlotito

> Downloading a document is fine as long as it is not shared outside?

I've fixed your question so that it accurately represents what I said and doesn't put words in my mouth.

If I click on a link and download a document, is that illegal?

I do not know if the person has the right to distribute it or not. IANAL, but when people were getting sued by the RIAA years back, it was never about downloading, but also distribution.

As I said, IANAL, but feel free to correct me, but my understanding is that downloading a document from the internet is not illegal.

CaptainFever

> it was never about downloading, but also distribution.

Did you mean to write "but about distribution" here?

jasonlotito

Yes, thank you for catching that. Unfortunately, I cannot edit it now.

eikenberry

Given that downloading requires you to copy the data to download it, I'd think it would fall under "adding new copies".

jasonlotito

> All Anthropic did was replace the print copies it had purchased ... with more convenient space-saving and searchable digital copies for its central library — without adding new copies..."

That suggests otherwise.

jpalawaga

I don't think that's new. google set precedent for that more than a decade ago. you're allowed to transform a book to digital.

alok-g

AFAIK, Judge Vince Chhabria has countered that Fair Use argument in a later order involving Meta.

https://www.courtlistener.com/docket/67569326/598/kadrey-v-m...

Note: I am not a lawyer.

seuraughty

Feels like information laundering to me.

franczesko

Is fruit of the poisonous tree rule applicable here?

gruez

MaxPock

How times change .They wanted to lock up Aaron Schwartz for life for essentially doing the same thing Anthropic is doing.

m4x

sershe

melagonster

Machines do not have rights belonging to human now.

First, Authors argue that using works to train Claude’s underlying LLMs was like using works to train any person to read and write, so Authors should be able to exclude Anthropic from this use (Opp. 16). Second, to that last point, Authors further argue that the training was intended to memorize their works’ creative elements — not just their works’ non-protectable ones (Opp. 17). Third, Authors next argue that computers nonetheless should not be allowed to do what people do.

Stallman places great importance on the words and labels people use to talk about the world, including the relationship between software and freedom. He asks people to say free software and GNU/Linux, and to avoid the terms intellectual property and piracy (in relation to copying not approved by the publisher). One of his criteria for giving an interview to a journalist is that the journalist agrees to use his terminology throughout the article.

First, Authors argue that using works to train Claude’s underlying LLMs was like using works to train any person to read and write, so Authors should be able to exclude Anthropic from this use (Opp. 16). But Authors cannot rightly exclude anyone from using their works for training or learning as such. Everyone reads texts, too, then writes new texts. They may need to pay for getting their hands on a text in the first instance. But to make anyone pay specifically for the use of a book each time they read it, each time they recall it from memory, each time they later draw upon it when writing new things in new ways would be unthinkable. For centuries, we have read and re-read books. We have admired, memorized, and internalized their sweeping themes, their substantive points, and their stylistic solutions to recurring writing problems. ... In short, the purpose and character of using copyrighted works to train LLMs to generate new text was quintessentially transformative. Like any reader aspiring to be a writer, Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different. If this training process reasonably required making copies within the LLM or otherwise, those copies were engaged in a transformative use.

Alsup detailed Anthropic's training process with books: The OpenAI rival spent "many millions of dollars" buying used print books, which the company or its vendors then stripped of their bindings, cut the pages, and scanned into digital files.

Anthropic cut up millions of used books, and downloaded 7M pirated ones – judge

Discussion

Anthropic cut up millions of used books, and downloaded 7M pirated ones – judge

Discussion