The Alleged Theft: OpenAI and the Use of Copyrighted Material
The Alleged Theft: OpenAI and the Use of Copyrighted Material
Best-selling author Douglas Preston discovered that OpenAI's chatbot, ChatGPT, had a remarkable knowledge of his books, providing detailed plot summaries and descriptions of minor characters. This led Preston and other prominent authors to sue OpenAI for copyright infringement, alleging that their novels were used without permission to train the AI. This case raises important questions about the ethical and legal implications of using copyrighted material to train large language models like ChatGPT. Furthermore, it highlights the growing trend of tech companies pushing boundaries and potentially breaking rules without seeking prior consent. In this episode, Planet Money explores the controversy surrounding OpenAI's actions and the potential consequences for AI development and copyright law.
**Key takeaways**
1. OpenAI's chatbot, ChatGPT, was accused of having knowledge of authors' books without permission.
2. Prominent authors, including Douglas Preston and George R.R. Martin, sued OpenAI for copyright infringement.
3. The case raises ethical and legal questions about using copyrighted material to train AI models.
4. It reflects a broader trend of tech companies pushing boundaries without seeking prior consent.
5. The outcome of the lawsuit could have significant implications for AI development and copyright law.
Episode transcript
This message comes from NPR sponsor, Noom, using science and personalization so you can manage your weight. Noom helps you understand the science behind your eating choices and why you have those cravings. Sign up for your trial today at Noom.com. This message comes from NPR sponsor, Nest Ami Renewable Diesel, a drop-in replacement for fossil fuel that has the power to keep your fleet running at top performance while lowering greenhouse gas emissions by up to 75%. Visit nestami.com to learn more. Before we start, this episode discusses Google and Spotify, which are both corporate sponsors of NPR. We also discuss OpenAI. One of OpenAI's major investors is Microsoft, which is also a corporate sponsor of NPR. Here's the show. This is Planet Money from NPR. Douglas Preston got his big break as a writer when he and his co-author published their first novel, Relic, in 1995. Relic is about a brain-eating monster loose in a museum hunting down and killing people and eating part of their brains. So it's, you know, you will not see my name on the list of Nobel laureates. That's for sure. No Nobel, maybe, but the book was a bestseller, the first of many. And how many books have you written altogether? I'm not sure. I think about 40. Douglas also somehow finds time to write all these articles and books about paleontology and archeology. He's got a lot of interest. He's a curious guy. And one day, that curiosity led him to start playing around with the tech world's shiny new thing, artificial intelligence, specifically OpenAI's chatbot, ChatGPT. Douglas got himself an account and started seeing what this fancy new AI chatbot could do. While we talked, he scrolled back through his history and read me some of his earliest queries. I had to write a paragraph about the execution of Socrates. Please discuss Chopin's Piano Concerto No. 1. Discuss the Transcendental No. E. Okay, so it appeared to know some math and some history and some music. And it didn't take long for Douglas to wonder, does it know me? Specifically, did ChatGPT know anything about the books he had written? So he starts testing it. Are you familiar with a character called Whittlesey in the novel Relic? Yes, Dr. Whittlesey is one of the characters in the prologue of the book. He's part of the expedition team that travels to the Amazon rainforest and makes a significant discovery which sets the stage for the events that unfold in the story. Is that answer correct? Yes. And Douglas was like, how does it know all that stuff? The Wikipedia entry on Relic doesn't have this kind of detail. And Relic was reviewed, but the reviews were never fine-grained like that. The only way it would know that is if it had ingested the book. Douglas kept going. He asked about other books he had written. ChatGPT knew that his character, Agent Aloysius Pendergast, had platinum hair and how Corey Swanson was a headstrong forensics expert. It was regurgitating everything. It knew my characters. It knew their names. It knew the settings. It knew everything. So yeah, it certainly seemed like ChatGPT had access to his full books. Maybe legitimate digital copies, maybe pirated PDFs floating around the internet. Who knows? But either way, Douglas owned the copyright to all of his books. And no one from OpenAI had asked him whether they could use them. Which raised the question, can they do that? Hello and welcome to Planet Money. I'm Keith Romer. And I'm Erica Barris. What happened to Douglas Preston feels a little like a thing that keeps happening to all of us. One giant tech company or another swoops in and just does a bunch of stuff without our permission, like keeping track of the websites we visit. Google, I see you. Or showing up to a city and setting up a new unregulated kind of taxi service, even though the city says you can't do that. Hi, Uber. It's like the famous Mark Zuckerberg line, move fast and break things. Tech companies have been doing a lot of that. And the latest example is OpenAI and all those other new AI companies, hoovering up every last piece of human creativity to build their incredibly powerful computer programs. Today on the show, we try to get our heads around what OpenAI is up to. Is it good? Is it bad? Is it legal? And we'll look back at these two formative legal cases that are super fascinating on their own, but also offer us a glimpse of how things with OpenAI might turn out. This message comes from NPR sponsor, American Express Business. The enhanced American Express Business Gold Card is designed to take your business further. It's packed with features and benefits like flexible spending capacity that adapts to your business, 24-7 support from a business card specialist trained to help with your business needs and so much more. The Amex Business Gold Card, now smarter and more flexible. That's the powerful backing of American Express. Terms apply. Learn more at americanexpress.com slash business gold card. Support for NPR and the following message come from Edward Jones. When you wanna navigate through the complexities of retirement strategies, it can help to sync up with an Edward Jones financial advisor who takes an approach that puts your goals first. Let's figure it out together, Edward Jones. Okay, so we should maybe start by talking a little bit about how ChatGBT works. It's an interface built on a kind of artificial intelligence called a large language model. And what that AI does is essentially predict what the next word in a sentence will be, like auto-complete, but on the grandest scale you can imagine. And to train the AI to do that, computer programmers have to feed it just massive, massive amounts of coherent writing. The technology is only possible because of all that text that it gobbles up. What the author Douglas Preston suspected was that a lot of that text came from copyrighted material, his books and other people's books. I'll never forget a conversation I had with my friend George R.R. Martin, and he was really upset. He said, somebody used ChatGBT to write the final book in my Game of Thrones series. It's my characters, my settings, even my voice as an author, they somehow were able to duplicate using that program. Douglas and George R.R., they got together with 15 other authors and decided to sue OpenAI. Their lawsuit is a class action. They're suing on behalf of themselves and any other professional fiction writers whose work may have been eaten up to create ChatGBT. What evidence do we have that OpenAI was using copyrighted books in its training sets? Right, that is a really good question. That is Mary Rasenberger. She is a copyright lawyer and the CEO of the Authors Guild. Douglas and George R.R. and the other authors wound up partnering with the Authors Guild for their lawsuit against OpenAI. They alleged copyright infringements on an industrial scale. So we do not know because OpenAI, even though they say they're open, they're quite the contrary. They are about as closed as can be in terms of what their training data sets are. Is Jonathan Franzen's book The Corrections in the Training Data? What about My Sister's Keeper by Jodi Picoult or Lincoln in the Bardo by George Saunders? Those authors, by the way, are all plaintiffs on this lawsuit. To start building their case, the authors and their lawyers went looking for concrete evidence. And if the humans at OpenAI wouldn't disclose their training data, maybe there was a way to trick OpenAI's computer program into giving up its sources. Some of the lawyers working with the Authors Guild got to work trying to coax ChatGPT into revealing what it knows. They asked it questions to see how much specific information it can offer up about any particular book. And of course, when you could get it to give you back exact text, clearly it had memorized the book. It seems like a strong sign that it can give you the actual chapter of the book. Yes, yes, yes. Because of the court case, Mary was a little cagey about giving exact details here, but other researchers have managed to get ChatGPT to spit up an entire Dr. Seuss book, full chapters of Harry Potter. Still, to really make the case that OpenAI had, in fact, used all these thousands and thousands of books to train its AI, what the authors really needed was access to the company's records, which Mary says was another reason to sue. In a lawsuit, you get discovery, and presumably we'll find out what the training data set is and what was ingested. So, okay, this lawsuit was just filed in September. And so this is kind of where the author's story pauses for now, because it could take literally years for this to work out. But like we said before, there are precedents for what happens when a giant tech company snatches up heaps of copyrighted stuff. Two cases really stand out here. Okay, case number one, the time Google decided to scan all of the books and put them on the internet. And case number two, the time Spotify decided to go ahead and put all of the songs on the internet. Okay, let's start with the first case, the one about Google and the books. In some ways, this is kind of the law's first big brush with the problem of how much copyrighted material a tech company can scoop up. It is a case that Mary from the Authors Guild remembers well. So the Google Books case was filed in 2005. Google wanted to create what some people refer to as a digital library of Alexandria. Yeah, they made all of these deals with big university libraries around the country that let them come in and add all these books to their giant searchable Google databases. They had ingested, copied millions of books. They literally were just taking truckloads of books out of libraries and scanning them. Google had permission from the libraries to scan the books, but they did not ask permission from the authors. And around 80% of those books were still protected by copyright. So authors and publishers sued. Now, everyone agreed Google had copied lots of copyrighted material without permission from the authors. But copyright, it's not absolute. There are some exceptions. Yeah, copyright law is trying to balance these two interests. On the one hand, a desire for authors to be allowed to make money from what they've created. But on the other hand, a desire for the rest of society to sometimes be allowed to borrow and remix and play around with the work of those authors. The fancy legal name for the kind of copying that the law says is okay is fair use. So the traditional fair uses are things like quoting, quoting from another book in your book or from a speech commentary. So when you do a critique of a play or a book, you're going to include perhaps some of the text from it. Copying a song to write a parody like Weird Al Yankovic style is also usually fine. Same with photocopying a couple pages of a novel to teach in an English class. But what about what Google was doing? Scanning millions and millions of books to create a searchable database. No one had ever seen anything like that before. Now, there is no hard and fast rule for what counts as fair use and what doesn't. There are these four different factors that a judge is supposed to look at to decide whether a certain act of copying is permissible. Yeah, is someone going to make money off of it? Or are they just doing it for the sake of doing it? Will it hurt the market for the original work? Is it a big important chunk that is copied or a small one? And is the thing that was copied transformed somehow into something new? I will say that the test can be somewhat subjective and, you know, the great minds can come out differently sometimes on fair use. The great minds in the Google case, a judge named Pierre Laval, he weighed all those fair use factors and decided that all that copying Google had done was fair use. It all came down to the end product Google had created, a giant database of books that people could search directly that would give them back relevant chunks of these books. It was a way for people all around the world to access books that otherwise might have just gathered dust in the basement of a big library somewhere. Judge Laval thought that was valuable enough to society that it made all that copying legally okay. And it's worth pointing out this kind of weird thing about copyright here. Fair use is not this cut and dry thing. So when a company like Google wants to play around at the edges of copyright, it has to just dive in without knowing for sure whether or not the thing they're doing will turn out to be legal. You can't always predict the outcome. Let me say it that way. It is wild that these companies are in some ways incentivized to take a risk of some amount and see if it works out because the courts will decide one way or another eventually. Well, that's what the tech companies like to do. You know, they like to ask permission later. Just do, don't tell anyone what you're doing and then just see what happens. And just to bring this back to where we started this episode, that is certainly what it appears OpenAI has done with ChatGPT. By the way, we reached out to OpenAI. They declined to comment, but in court filings, they've made it pretty clear that they think what they did to train their AI, that was fair use. Right, getting a Google Books type ruling would be a great outcome for OpenAI. Mary, who is part of the suit against OpenAI, she does not see it that way. This case is very different than that case because here the harm is so visible. It's so clear that the marketplace for creators' works will be harmed by generative AI, which, remember, that is one of the four factors a judge is supposed to look at in a fair use case. How much will the owners of the copyright be financially hurt by the copying of their books? It's the commercial use of the works to develop these machines that will spit out very quickly massive quantities of text that will compete with what they were trained on. That's the issue here. All the Dan Brown novels I could ever want. Yeah. So, okay, if some judge decides that it is fair use for OpenAI to train Chachi P.T. on copyrighted material, then, like Google Books, that's it. Sorry, authors. But what about the other end of the spectrum? What if a judge says all that copying was against the law? Thousands of authors with dozens and dozens of books and each one is a copyright violation? After the break, we do the math on how much that could cost OpenAI and look at the most likely scenario for how all this plays out. Support for this podcast and the following message come from Amazon Business. Amazon Business knows your business needs a lot. From markers to monitors, desk chairs to standing desks, you could spend all day tracking down products from an endless list of suppliers. With Amazon Business, you can streamline purchasing to make the most of business spend and your time. Access millions of supplies in a single store with quantity discounts on bulk orders, turning your to-do list into your all-done list. Everything you need in one place, that's smart business buying. Learn more at amazonbusiness.com. This message comes from NPR sponsor Bombas. Bombas has donated over 100 million socks, underwear, and T-shirts to those facing housing insecurity. One item purchased equals one item donated. Visit bombas.com slash NPR, code NPR. Support for NPR and the following message come from Bombas. Make holiday gift giving easier this year with absurdly soft socks, underwear, and T-shirts from Bombas. It's big comfort for everyone on your list. Get 20% off your first purchase at bombas.com slash NPR and use code NPR. If you're looking for a new way to support this show and public media, please consider signing up for the NPR Plus podcast bundle. NPR Plus listeners get to unlock sponsor-free listening and bonus episodes from NPR shows like this one. You can find out more at plus.npr.org. So in the last few years, tech companies have been basically vacuuming up all of human knowledge and culture to train their AIs. And lately, some of the creators of all of that human knowledge and culture have started pushing back. Yeah, in addition to the author's lawsuit against OpenAI, there are at least eight other lawsuits brought by songwriters and visual artists and other authors against a bunch of AI companies, all alleging copyright infringement. And like we talked about before, it's possible that legally all of this is fine, that some court may decide this is fair use, but it's also possible that they won't. So in that world, the judge tells OpenAI, your AI is illegal, shut it down. Well, the thing is, it's not like OpenAI can simply remove the selected works of Douglas Preston and George R.R. Martin from their AI's brain. The company would have to basically start from scratch and completely retrain their AI. And then there's the money. So let's run a little back of the envelope math here. The statutory damages for a single act of copyright infringement can reach as high as $150,000 per infringement. Figure 10,000 authors, 10 books per author, you know what that multiplies out to? $15 billion. However, it is very unlikely that that will happen, which we will show you through case number two. Yeah, the time Spotify decided to stream all the songs. This one shows how sometimes a gigantic lawsuit can actually be a good thing for the tech company getting sued. To help explain this one, we reached out to UCLA law professor, Zian Tang. I guess I would say that I wanted to be a copyright lawyer from the time I was 16, which sounds really weird. Let's say unusual, we don't have to say weird. Yes, it's very unusual. Before she was a professor, Zian worked for a few big law firms. I worked on a Red Bull class action where the claim was like, you know, Red Bull gives you wings, but it actually doesn't give you wings. There's no more caffeine in it than a cup of coffee. Or like, you know, I bought this anti-aging product because I thought it would turn back time and I, you know, I'm 40, but I thought I would look 18 and I don't. And now I'm suing for it on behalf of myself in a class. And Zian was one of the lawyers on Spotify's defense team during the big case we're gonna talk about. Right, Spotify had been streaming millions and millions of songs, but they hadn't gotten licenses for all of those songs. The two main plaintiffs in the lawsuits that then eventually got consolidated into one lawsuit, one was filed by a songwriter named Melissa Farrick, and another was filed by a songwriter named David Lowery. He was in a band, a couple of bands that, you know, I think a lot of people are familiar with. One was Camper Van Beethoven, one was Cracker. Erica, you more of a low fan or a more of a what the world needs now fan? I actually don't know either of these. No, no Cracker songs? All right, I'll stay over here on Gen X Island by myself. I'm Gen Y. I'm the secret generation that lasted one year after Gen X. Okay, in any event, the lawsuit basically came down to this. 90% of the songs that Spotify wanted to stream in the US were managed by a handful of big companies. And Spotify had signed licensing deals with those companies. But that left this last 10% of songs that Spotify also wanted to stream. Spotify hired an outside company to get deals with the copyright holders for those songs, but someone somewhere along the line dropped the ball. And even though they didn't end up getting licenses for all those songs, Spotify went ahead and streamed them anyway. And so Spotify tried and wanted to do everything right by the books. But the reality is that it's the music publishers themselves that have really bad data that makes it like near impossible for someone to figure out who to pay. But that feels like an argument that I would be sympathetic hearing from my nine-year-old daughter in terms of like, I tried to do the right thing, but I couldn't. But legally, would that hold any water in terms of, it's not our fault. We couldn't do it. We tried. So, you know, I think there's a couple parts to your question. One is, legally, would it hold water? No. I mean, legally, it wouldn't hold water. Did they have a point? I think they did. And this is where the Spotify case gets really interesting because Zian says getting sued by those two songwriters was kind of fantastic news for Spotify. I'm definitely not speaking for Spotify here when I say it's almost a blessing, but it does almost feel like a relief to be able to say, oh, now we have this class that's established with all these people in it. Let's pay some amount of money that's not gonna bankrupt the business and allow us to say, hey, we're actually paying all these people now, whereas the allegation was that we weren't before, and we can keep operating. So, I mean, it sounds a little like Spotify's essential problem was not having an opposite side to negotiate with, and the class action essentially gave them somebody to negotiate with. Yes, it's like, you know, yeah, exactly. We didn't know who to even go out to and talk to about this. And now these people are popping up out of the woodwork and saying, hey, it's me. And, you know, God, I'm thinking about Taylor Swift. I'm the problem, it's me. I immediately had to say, my daughter listens to Taylor Swift 24 hours a day, and you said those words, and I was like, yep, that song's in my head now. Right, yeah, I'm the problem. Actually, you know, I'm the legal problem. Negotiate with me. I mean, put yourself in Spotify's shoes. There's this 10% of songs that they wanted to license, but tracking down every indie artist and every indie indie artist and unspooling the knot of publishing rights, it wasn't gonna happen. And then one day, these two musicians show up and say, we represent that entire 10%. Like, that is kind of great for Spotify. In the end, the class action didn't go to trial. The company and the folks who had songs in that tricky 10% ended up reaching a deal. Spotify agreed to pay them for all its past copyright infringements and set up a system to pay for streaming royalties going forward. And you know, if we were looking for examples of how the class action by the authors against open AI might play out, there's a really good chance this is it. No giant dramatic trial, just two sides working out a deal. ZM has looked pretty deeply into the history of these kinds of cases. I did a study where I looked at every single class action that was filed between basically the advent of the class action mechanism, you know, a century ago to recent date, up to the point where the article came out, which was, I think, last year. In over 100 copyright class actions, only one ever went all the way to a full trial. And yet, they. And yet, they keep being filed. And they keep being filed. And, you know, that's why I say it's almost like it's an invitation to settlement, I think. So essentially, we have this whole legal theater, which is just the beginning of a negotiation. Yes, correct. So if you think about the author's lawsuit from OpenAI's perspective, maybe the lawsuit isn't the worst thing. The company has used all of this copyrighted material, allegedly, hundreds of thousands of books. There is no good way to unfeed all of those books to their AI. But also, it would be a huge pain to track down every single author and work out a licensing deal for those books. So maybe this lawsuit will let them do it all in one fell swoop by negotiating with this handy group of thousands of authors who have collectively sued them. This episode was produced by Willow Rubin and Sam Yellowhorse-Kessler. It was edited by Kenny Malone and fact-checked by Sierra Juarez, engineering by Robert Rodriguez. Alex Goldmark is our executive producer. Coming up next week on Planet Money, China's economy is on the brink of a crisis, and we're going to figure out how they got there. Quick hint, it's real estate. You know, I was in that game. So if you're not taking a maximum risk to expand your business empire, next year you look at your peers and say, like, damn, you know, I only built 10,000 apartments. They already they're selling 15. I'm behind. That's next week on Planet Money from NPR. And thanks today to Danielle Gervais, Dawa Kila, and Douglas Preston's co-author, Lincoln Child. I'm Keith Romer. And I'm Erica Beres. This is NPR. Thanks for listening. This message comes from NPR sponsor MassMutual. Don't let the day-to-day of running your business keep you from planning for tomorrow. Start preparing for the future. Talk to a MassMutual financial professional today. You'll feel comfortable about tomorrow. This message comes from NPR sponsor Discover. Did you know Discover wants everyone to feel special? That's why with your Discover card, you have access to 24-7 customer service as well as zero dollar fraud liability, which means you're never held responsible for unauthorized purchases. Learn more at discover.com slash credit card. Limitations apply. With NPR Plus, there's more to hear, like extended interviews with some of the experts we talked to at Planet Money and The Indicator. It's a mistake for economists to only think about economic efficiency when considering policies because you'll actually wind up with a worse outcome. And with NPR Plus, you help keep NPR going. Learn more at plus.npr.org.