OpenAI now tries to hide that ChatGPT was trained on copyrighted books, including J.K. Rowling's Harry Potter series

L4sBot@lemmy.world · 3 years ago

OpenAI now tries to hide that ChatGPT was trained on copyrighted books, including J.K. Rowling's Harry Potter series

Uriel238 [all pronouns]@lemmy.blahaj.zone · edit-2 3 years ago

Training AI on copyrighted material is no more illegal or unethical than training human beings on copyrighted material (from library books or borrowed books, nonetheless!). And trying to challenge the veracity of generative AI systems on the notion that it was trained on copyrighted material only raises the specter that IP law has lost its validity as a public good.

The only valid concern about generative AI is that it could displace human workers (or swap out skilled jobs for menial ones) which is a problem because our society recognizes the value of human beings only in their capacity to provide a compensation-worthy service to people with money.

The problem is this is a shitty, unethical way to determine who gets to survive and who doesn’t. All the current controversy about generative AI does is kick this can down the road a bit. But we’re going to have to address soon that our monied elites will be glad to dispose of the rest of us as soon as they can.

Also, amateur creators are as good as professionals, given the same resources. Maybe we should look at creating content by other means than for-profit companies.

rosenjcb@lemmy.world · edit-2 3 years ago

The powers that be have done a great job convincing the layperson that copyright is about protecting artists and not publishers. It’s historically inaccurate and you can discover that copyright law was pushed by publishers who did not want authors keeping second hand manuscripts of works they sold to publishing companies.

Additional reading: https://en.m.wikipedia.org/wiki/Statute_of_Anne

Skanky@lemmy.world · 3 years ago

Vanilla Ice had it right all along. Nobody gives a shit about copyright until big money is involved.

Cyfuture AI@lemmy.world · 10 months ago

OpenAI has stated that its models were trained on publicly available and licensed data. There is no confirmed evidence that ChatGPT was specifically trained on copyrighted books like J.K. Rowling’s Harry Potter series. The company has not disclosed the full details of its training data.

Default_Defect@midwest.social · 3 years ago

They made it read Harry Potter? No wonder its gonna kill us all one day.

Blapoo@lemmy.ml · 3 years ago

We have to distinguish between LLMs

Trained on copyrighted material and
Outputting copyrighted material

They are not one and the same

TwilightVulpine@lemmy.world · 3 years ago

Should we distinguish it though? Why shouldn’t (and didn’t) artists have a say if their art is used to train LLMs? Just like publicly displayed art doesn’t provide a permission to copy it and use it in other unspecified purposes, it would be reasonable that the same would apply to AI training.

Blapoo@lemmy.ml · 3 years ago

Ah, but that’s the thing. Training isn’t copying. It’s pattern recognition. If you train a model “The dog says woof” and then ask a model “What does the dog say”, it’s not guaranteed to say “woof”.

Similarly, just because a model was trained on Harry Potter, all that means is it has a good corpus of how the sentences in that book go.

Thus the distinction. Can I train on a comment section discussing the book?

Even_Adder@lemmy.dbzer0.com · 3 years ago

Yeah, this headline is trying to make it seem like training on copyrighted material is or should be wrong.

scv@discuss.online · 3 years ago

Legally the output of the training could be considered a derived work. We treat brains differently here, that’s all.

I think the current intellectual property system makes no sense and AI is revealing that fact.

TropicalDingdong@lemmy.world · 3 years ago

I think this brings up broader questions about the currently quite extreme interpretation of copyright. Personally I don’t think its wrong to sample from or create derivative works from something that is accessible. If its not behind lock and key, its free to use. If you have a problem with that, then put it behind lock and key. No one is forcing you to share your art with the world.

Tetsuo@jlai.lu · 3 years ago

Output from an AI has just been recently considered as not copyrightable.

I think it stemmed from the actors strikes recently.

It was stated that only work originating from a human can be copyrighted.

Anders429@lemmy.world · 3 years ago

Output from an AI has just been recently considered as not copyrightable.

Where can I read more about this? I’ve seen it mentioned a few times, but never with any links.

Even_Adder@lemmy.dbzer0.com · 3 years ago

They clearly only read the headline If they’re talking about the ruling that came out this week, that whole thing was about trying to give an AI authorship of a work generated solely by a machine and having the copyright go to the owner of the machine through the work-for-hire doctrine. So an AI itself can’t be authors or hold a copyright, but humans using them can still be copyright holders of any qualifying works.

fubo@lemmy.world · 3 years ago

If I memorize the text of Harry Potter, my brain does not thereby become a copyright infringement.

A copyright infringement only occurs if I then reproduce that text, e.g. by writing it down or reciting it in a public performance.

Training an LLM from a corpus that includes a piece of copyrighted material does not necessarily produce a work that is legally a derivative work of that copyrighted material. The copyright status of that LLM’s “brain” has not yet been adjudicated by any court anywhere.

If the developers have taken steps to ensure that the LLM cannot recite copyrighted material, that should count in their favor, not against them. Calling it “hiding” is backwards.

Gyoza Power@discuss.tchncs.de · 3 years ago

Let’s not pretend that LLMs are like people where you’d read a bunch of books and draw inspiration from them. An LLM does not think nor does it have an actual creative process like we do. It should still be a breach of copyright.

efstajas@lemmy.world · 3 years ago

… you’re getting into philosophical territory here. The plain fact is that LLMs generate cohesive text that is original and doesn’t occur in their training sets, and it’s very hard if not impossible to get them to quote back copyrighted source material to you verbatim. Whether you want to call that “creativity” or not is up to you, but it certainly seems to disqualify the notion that LLMs commit copyright infringement.

Eccitaze@yiffit.net · 3 years ago

If Google took samples from millions of different songs that were under copyright and created a website that allowed users to mix them together into new songs, they would be sued into oblivion before you could say “unauthorized reproduction.”

You simply cannot compare one single person memorizing a book to corporations feeding literally millions of pieces of copyrighted material into a blender and acting like the resulting sausage is fine because “only a few rats fell into the vat, what’s the big deal”

jadegear@lemm.ee · 3 years ago

Terrible analogy.

AlexisLuna@lemmy.blahaj.zone · 3 years ago

Which one? And why exactly?

player2@lemmy.dbzer0.com · 3 years ago

The analogy talks about mixing samples of music together to make new music, but that’s not what is happening in real life.

The computers learn human language from the source material, but they are not referencing the source material when creating responses. They create new, original responses which do not appear in any of the source material.

afraid_of_zombies@lemmy.world · 3 years ago

I am sure they have patched it by now but at one point I was able to get chatgpt to give me copyright text from books by asking for ever large quotations. It seemed more willing to do this with books out of print.

RadialMonster@lemmy.world · 3 years ago

what if they scraped a whole lot of the internet, and those excerpts were in random blogs and posts and quotes and memes etc etc all over the place? They didnt injest the material directly, or knowingly.

beetus@sh.itjust.works · 3 years ago

Not knowing something is a crime doesn’t stop you from being prosecuted for committing it.

It doesn’t matter if someone else is sharing copyright works and you don’t know it and use it in ways that infringes on that copyright.

“I didn’t know that was copyrighted” is not a valid defence.

chemical_cutthroat@lemmy.world · 3 years ago

That’s why this whole argument is worthless, and why I think that, at its core, it is disingenuous. I would be willing to be a steak dinner that a lot of these lawsuits are just fishing for money, and the rest are set up by competition trying to slow the market down because they are lagging behind. AI is an arms race, and it’s growing so fast that if you got in too late, you are just out of luck. So, companies that want in are trying to slow down the leaders, at best, and at worst they are trying to make them publish their training material so they can just copy it. AI training models should be considered IP, and should be protected as such. It’s like trying to get the Colonel’s secret recipe by saying that all the spices that were used have been used in other recipes before, so it should be fair game.

uzay@infosec.pub · 3 years ago

I hope OpenAI and JK Rowling take each other down

Touching_Grass@lemmy.world · 3 years ago

What’s the issue against openAI?

Corkyskog@sh.itjust.works · 3 years ago

They used to be a non profit, that immediately turned it into a for profit when their product was refined. They took a bunch of people’s effort whether it be training materials or training Monkeys using the product and then slapped a huge price tag on it.

BURN@lemmy.world · 3 years ago

They’re stealing a ridiculous amount of copyrighted works to use to train their model without the consent of the copyright holders.

This includes the single person operations creating art that’s being used to feed the models that will take their jobs.

OpenAI should not be allowed to train on copyrighted material without paying a licensing fee at minimum.

uzay@infosec.pub · 3 years ago

Also Sam Altman is a grifter who gives people in need small amounts of monopoly money to get their biometric data

LifeInMultipleChoice@lemmy.ml · 3 years ago

So hypothetical here. If Dreddit did launch a system that made it so users could trade Karma in for real currency or some alternative, does that mean that all fan fictions and all other fan boy account created material would become copyright infringement as they are now making money off the original works?

Touching_Grass@lemmy.world · 3 years ago

If they purchased the data or the data is free its theirs to do what they want without violating the copyright like reselling the original work as their own. Training off it should not violate any copyright if the work was available for free or purchased by at least one person involved. Capitalism should work both ways

BURN@lemmy.world · 3 years ago

But they don’t purchase the data. That’s the whole problem.

And copyright is absolutely violated by training off it. It’s being used to make money and no longer falls under even the widest interpretation of free use.

Touching_Grass@lemmy.world · 3 years ago

How do they get the data if its not purchased or freely available

BURN@lemmy.world · 3 years ago

It may be freely available for non-commercial works, eg. Photos on Photobucket, internet archive free book archives, etc.

Most everything is on the internet these days, copyrighted or not. I’m sure if I googled enough I could find the entire text of Harry Potter for free. I still haven’t purchased it, and technically it’s not legally freely available. But in training these models I guarantee they didn’t care where the data came from, just that it was data.

I’m against piracy as well for the record, but pretty much everything is available through torrenting and pirate sites at this point, copyright be damned.

Touching_Grass@lemmy.world · edit-2 3 years ago

Don’t care, that’s not mine or these LLMs problem they don’t secure their copyright. They shouldn’t come asking for others to pay for them not securing their data. I see it as a double edged sword.

I really hope this is a wake up call to all creative types to pack up and not use the internet like a street corner while they busk.

If they want to come online to contribute like everybody else. Just have fun and post stuff, that’s great. But all of them are no different then any other greedy corporation. They all want more toll roads. When they do make it and earn millions and get our attention they exploit it with more ads. It swallows all the free good content. Sites gear towards these rich creators. They lawyer up and sue everybody and everything that looks or sounds like them. We lose all our good spaces to them.

I hope the LLM allows regular people to shit post in peace finally.

GroggyGuava@lemmy.world · 3 years ago

You need to expand on how learning from something to make money is somehow using the original material to make money. Considering that’s how art works in general, I’m having a hard time taking the side of “learning from media to make your own is against copyright”. As long as they don’t reproduce the same thing as the original, I don’t see any issues with it. If they learned from Lord of the rings to then make “the Lord of the rings” then yes, that’d be infringement. But if they use that data to make a new IP with original ideas, then how is that bad for the world/ artists.

BURN@lemmy.world · 3 years ago

Creating an AI model is a commercial work. They’re made to make money. Now these models are dependent on other artists data to train on. The models would be useless if they weren’t able to train on anything.

I hold the stance that using copyrighted data as part of a training set is a violation of copyright. That still hasn’t been fully challenged in court, so there’s no specific legal definition yet.

Due to the requirement of copywritten materials to make the model function I feel that they are using copyrighted works in order to build a commercial product.

Also AI doesn’t learn. LLMs build statistical models based on sentence structure of what they’ve seen before. There’s no level of understanding or inherent knowledge, and there’s nothing new being added.

ClamDrinker@lemmy.world · 3 years ago

This is just OpenAI covering their ass by attempting to block the most egregious and obvious outputs in legal gray areas, something they’ve been doing for a while, hence why their AI models are known to be massively censored. I wouldn’t call that ‘hiding’. It’s kind of hard to hide it was trained on copyrighted material, since that’s common knowledge, really.

paraphrand@lemmy.world · 3 years ago

Why are people defending a massive corporation that admits it is attempting to create something that will give them unparalleled power if they are successful?

Cosmic Cleric@lemmy.world · 3 years ago

Because ultimately, it’s about the truth of things, and not what team is winning or losing.

Whimsical@lemmy.world · 3 years ago

The dream would be that they manage to make their own glorious free & open source version, so that after a brief spike in corporate profit as they fire all their writers and artists, suddenly nobody needs those corps anymore because EVERYONE gets access to the same tools - if everyone has the ability to churn out massive content without hiring anyone, that theoretically favors those who never had the capital to hire people to begin with, far more than those who did the hiring.

Of course, this stance doesn’t really have an answer for any of the other problems involved in the tech, not the least of which is that there’s bigger issues at play than just “content”.

bamboo@lemm.ee · 3 years ago

Mostly because fuck corporations trying to milk their copyright. I have no particular love for OpenAI (though I do like their product), but I do have great distain for already-successful corporations that would hold back the progress of humanity because they didn’t get paid (again).

assassin_aragorn@lemmy.world · 3 years ago

There’s a massive difference though between corporations milking copyright and authors/musicians/artists wanting their copyright respected. All I see here is a corporation milking copyrighted works by creative individuals.

msage@programming.dev · 3 years ago

But OpenAI will do the same?

uis@lemmy.world · 3 years ago

It’s like argument “but new politicians will steal more” that I hear in Russia from people who protect Putin

msage@programming.dev · 3 years ago

It’s literally not, wtf.

Do not let any private entity to get overwhelming majority on anything period.

But do not kid yourself that Microsoft will let OpenAI do anything for public once it gets big enough.

OpenAI is open only in name after they rolled back all the promises of being for everyone.

uis@lemmy.world · 3 years ago

That’s my entire point. It’s not who, but how long.

Also Microsoft plays both sides here. OpenAI vs copyright is wrong question. There’s more: both are status-quo. Both are for keeping corporate ownership of ideas.

Crozekiel@lemmy.zip · 3 years ago

AI is the new fan boy following since it became official that nfts are all fucking scams. They need a new technological God to push to feel superior to everyone else…

SCB@lemmy.world · 3 years ago

Leftists hating on AI while dreaming of post-scarcity will never not be funny

Technoguyfication@lemmy.ml · 3 years ago

People are acting like ChatGPT is storing the entire Harry Potter series in its neural net somewhere. It’s not storing or reproducing text in a 1:1 manner from the original material. Certain material, like very popular books, has likely been interpreted tens of thousands of times due to how many times it was reposted online (and therefore how many times it appeared in the training data).

Just because it can recite certain passages almost perfectly doesn’t mean it’s redistributing copyrighted books. How many quotes do you know perfectly from books you’ve read before? I would guess quite a few. LLMs are doing the same thing, but on mega steroids with a nearly limitless capacity for information retention.

Hup!@lemmy.world · edit-2 3 years ago

Nope people are just acting like ChatGPT is making commercial use of the content. Knowing a quote from a book isn’t copyright infringement. Selling that quote is. Also it doesn’t need to be content stored 1:1 somewhere to be infringement. That misses the point. If you’re making money of a synopsis you wrote based on imperfect memory and in your own words it’s still copyright infringment until you sign a licensing agreement with JK. Even transforming what you read into a different medium like a painting or poetry cam infinge the original authors copyrights.

Now mull that over and tell us what you think about modern copyright laws.

Ronath@lemmy.world · 3 years ago

Just adding, that, outside of Rowling, who I believe has a different contract than most authors due to the expanded Wizarding World and Pottermore, most authors themselves cannot quote their own novels online because that would be publishing part of the novel digitally and that’s a right they’ve sold to their publisher. The publisher usually ignores this as it creates hype for the work, but authors are careful not to abuse it.

abbotsbury@lemmy.world · 3 years ago

but on mega steroids with a nearly limitless capacity for information retention.

That sounds like redistributing copyrighted books

Teritz@feddit.de · 3 years ago

Using Copyrighted Work as Art as example still influences the AI which their make Profit from.

If they use my Works then they need to pay thats it.

coheedcollapse@lemmy.world · 3 years ago

Still kinda blows my mind how like the most socialist people I know (fellow artists) turned super capitalist the second a tool showed like an inkling of potential to impact their bottom line.

Personally, I’m happy to have my work scraped and permutated by systems that are open to the public. My biggest enemy isn’t the existence of software scraping an open internet, it’s the huge companies who see it as a way to cut us out of the picture.

If we go all copyright crazy on the models for looking at stuff we’ve already posted openly on the internet, the only companies with access to the tools will be those who already control huge amounts of data.

I mean, for real, it’s just mind-blowing seeing the entire artistic community pretty much go full-blown “Metallica with the RIAA” after decades of making the “you wouldn’t download a car” joke.

dx1@lemmy.world · 3 years ago

Nobody would defend copyright if it wasn’t already in place, it’s a sick idea. They ask us to cut the field of human knowledge for private benefit. Now they want to destroy a new technology in its name. Greed knows no bounds.

voluble@lemmy.world · 3 years ago

Nobody would defend copyright if it wasn’t already in place

I don’t know about that. Say you take a few years to write a handful of poems, and it turns out people in your neighborhood really like them. You compile the poems into a book, and sell it for $5, and it sells well. Seeing this, your neighbor buys one, copies it, and starts selling it one neighborhood over for $2, and representing themself as the author. I would think most people in that situation would want to say, ‘hey, that’s not fair’. I don’t think that’s sick or rooted in greed, copyright can be a check on greed.

dx1@lemmy.world · 3 years ago

So thanks to copyright, we’re now living in a world where artists are fairly compensated and not exploited by large corporations acting as middlemen that have seized control of their creative works and used it for their own profit?

BURN@lemmy.world · 3 years ago

More so than we would be without copyright at all

Copyright needs to be extended for individuals and cut back for corporations. People should be allowed to own rights to their ip, but corps should have much higher levels of restrictions and how some knowledge must be shared.

dx1@lemmy.world · 3 years ago

More so than we would be without copyright at all

It’s hard to imagine how it could be worse than what we have now.

Copyright needs to be extended for individuals and cut back for corporations. People should be allowed to own rights to their ip, but corps should have much higher levels of restrictions and how some knowledge must be shared.

Well in effect that would scale back the copyright nightmare we have now, but the basic problem is still there. The argument is still for near-indefinite monopoly privilege over information to be given to its creator at the expense of humanity’s ability to share and reproduce the work, I don’t think that’s justifiable.

assassin_aragorn@lemmy.world · 3 years ago

So the people who generate and curate that knowledge don’t deserve to be compensated? Are you going to be a full time wikipedia editor then? Or does your “greed know no bounds”?

Hildegarde@lemmy.world · 3 years ago

I defend the idea of copyright. The first copyright law was in 1710, to protect authors from the printing press. Without copyright, whoever owned the printing press would sell copies of books with no obligation to pay the author. When copying art is trivial, the artist needs copyright protection in order to make a living creating art.

There are major problems with modern copyrights. Like all things in capitalism it has been subverted to benefit the rich, but the core idea behind copyright is sound.

These lawsuits are not to stop the development if generative AI. These lawsuits are to stop the unlicensed use of copyrighted works as AI training data.

There are AI models that are only trained with licensed data. This doesn’t stop the development of AI.

Artists should have the right to choose whether their work is used as training data. And they should be compensated fairly for it. That will be the case if these lawsuits succeed.

BURN@lemmy.world · 3 years ago

I defend copyright. The original intent was to protect creators in order to foster more creativity. Most artists will have no incentive to create if their work can be reappropriated by a larger group to leverage it for monetary gain, which is directly being taken from the original creator.

I’m a photographer. I’ve removed all my pictures from the internet and plan to never post more. I don’t want my work being used to train AI. Right now we have no choice in that matter, so the only option is to no longer share our work.

dx1@lemmy.world · 3 years ago

I’ve released tons of stuff and it’s under Creative Commons/public domain. I welcome people to share it or create derivative works.

BURN@lemmy.world · 3 years ago

Cool. That’s a fine stance to have and one that plenty of other people will have too. I’m fine with actual people doing it. I’m not fine with AI. The point is the artist should have a choice if they’d like to allow training.

The problem right now is we can’t control that. Everything is being used for AI training if you want it to be or not. If I could explicitly forbid use of it for AI training (that could be backed in court) I’d be more willing to post them again.

Lemmy users are not an accurate representation of artists imo. This site skews extremely far left, to the points of such anti-corporate nonsense that I believe the majority of people just want to hurt anyone with more money than them as much as possible.

angstylittlecatboy@reddthat.com · edit-2 3 years ago

I feel like a lot of internet people (not even just socialists) go from seeing copyright as at best a compromise that allows the arts to have value under capitalism to treating it like a holy doctrine when the subject of LLMs comes up.

Like, people who will say “piracy is always okay” will also say “ban AI, period” (and misrepresent organizations that want regulations on it’s use as wanting a full ban.)

Like, growing up with an internet full of technically illegal content (or grey area at best) like fangames and YouTube Poops made me a lifelong copyright skeptic. It’s outright confusing to me when people take copyright as seriously as this.

Sir_Kevin@lemmy.dbzer0.com · 3 years ago

Fuckin preach! I feel like I’m surrounded by children that didn’t live through the many other technologies that have came along and changed things. People lost their shit when photoshop became mainstream, when music started using samples, etc. AI is here to stay. These same people are probably listening to autotuned music all day while they complain on the internet about AI looking at their art.

Jat620DH27@lemmy.world · 3 years ago

I thought everyone knows that OpenAI has the same access to any books, knowledge that human beings have.

Redditiscancer789@lemmy.world · 3 years ago

Yes, but it’s what it is doing with it that is the murky grey area. Anyone can read a book, but you can’t use those books for your own commercial stuff. Rowling and other writers are making the case their works are being used in an inappropriate way commercially. Whether they have a case iunno ianal but I could see the argument at least.

Touching_Grass@lemmy.world · 3 years ago

Harry potter uses so many tropes and inspiration from other works that came before. How is that different? wizards of the coast should sue her into the ground.

scarabic@lemmy.world · 3 years ago

One of the first things I ever did with ChatGPT was ask it to write some Harry Potter fan fiction. It wrote a short story about Ron and Harry getting into trouble. I never said the word McGonagal and yet she appeared in the story.

So yeah, case closed. They are full of shit.

PraiseTheSoup@lemm.ee · 3 years ago

There is enough non-copywrited Harry Potter fan fiction out there that it would not need to be trained on the actual books to know all the characters. While I agree they are full of shit, your anecdote proves nothing.

Cosmic Cleric@lemmy.world · 3 years ago

While I agree they are full of shit, your anecdote proves nothing.

Why? Because you say so?

He brings up a valid point, it seems transformative.

LittleLordLimerick@lemm.ee · 3 years ago

The anecdote proves nothing because the model could potentially have known of the McGonagal character without ever being trained on the books, since that character appears in a lot of fan fiction. So their point is invalid and their anecdote proves nothing.