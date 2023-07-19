AI is having its hot girl summer. Chatbots like ChatGPT and AI image generators like Midjourney have captured our imagination with their power to create everything from poems to images of Donald Trump being arrested.

Google is among the companies under scrutiny for allegedly bending copyright law to train their AI tools. Lionel BONAVENTURE/AFP/Getty Images

But the data that trains these systems were generated by people — artists, writers and ordinary internet users alike. For professional artists and authors, seeing your creations drive systems that generate income for a company and not for you is creating something else: A new legal frontier.

Recently, Google has been sued for allegedly scraping people’s personal data to train its AI tools, while two authors sued OpenAI for copyright infringement, and comedian Sarah Silverman sued both OpenAI and Meta, alleging her copyrighted work was used to train their AI products.

Copyright is tricky to determine. Pamela Samuelson, co-director of the University of California Berkeley Center for Law & Technology has studied technology and the legal system since 1983, and recently published an article that tries make sense of how generative AI butts up on copyright — and why the legal case might not be as clear-cut as Silverman and other creators hope.

To get more perspective on the brewing fight between AI companies and creators, The Messenger spoke with Samuelson about the lawsuits against AI companies and what she thinks the future holds for this legal conundrum.

This conversation has been lightly edited for length and clarity.

The Messenger: How did you come to study generative AI? What was your first experience with it and what struck you?

Samuelsohn: Even before lawsuits were filed, it was hard not to notice that ChatGPT was one of the fastest growing technologies and that was really interesting. And of course, the copyright office started having to deal with questions about registering claims of copyright and computer generated stuff. I was interested to see how they would adapt. And then of course, these lawsuits got brought — I know some of the lawyers on these cases — and I needed to understand the technology, at least to a certain level of detail. One of the things that has been dogma for me since the 1980s is you can't regulate technology if you don't understand what it is about.

I wanted to understand: how does this stuff work? How does training work? How do these systems generate text? How do they generate images? If the courts are going to have to address issues about copyright, they can't just assume that the technology is like, “Hey, I copied this stuff when I ingested it and now there's output and the output is derived from the input and therefore it must be a derivative work.” It doesn't work that way.

My husband is a technologist and he's written about AI stuff and so I got lessons from him. And then I talked to a couple of other technical experts, and started building a mental model about what's going on under the hood, which is just to say: This is more complicated than some of the lawsuits might make it seem.

Break that down for me. What are some of the complications?

One thing that's really important is that the overwhelming majority of the content that's been used to make large language models and to make image models is digital content that's available on the open internet. There's a pretty good argument that anybody who puts their stuff up on the open internet, even if they don't expect it to be copied for training data, it's available. It's very common for web crawlers to essentially make copies of everything on the internet.

There have been various theories that people have tried to stop web scraping in one context or another, and overall have not been successful.

The copyright lawsuit is the most recent one, but there were lawsuits in the 1990s about trespass to chattel — this was the idea that by making copies of stuff on the internet you're using somebody's computer without permission and therefore that you're trespassing on their property. That didn't work. Then relatively recently in the hiQ versus LinkedIn case, there was an effort to stop web scraping by saying that it was a violation of the Computer Fraud and Abuse Act (CFAA). And the Supreme Court's February decision essentially interpreted the CFAA in a way that caused the Ninth Circuit Court of Appeals to decide that the web scraping claim that LinkedIn was making about hiQ was not was not in fact, a violation of the of the CFA and now we have this copyright lawsuit.

The current cases are in very early stages, and we'll see what happens with them, but this is just the latest in a series of cases challenging people for web scraping.

When it comes to the lack of legal victories over web scraping, do you see this as the legal system struggling to keep up with the tech or the plaintiffs getting over their skis?

The claims are novel. Most of the cases are actually class action lawsuits, and class action lawsuits are devices that allow people who are similarly situated to bring claims where each claim on its own probably would not justify bringing a whole lawsuit. But if you can get a class certified that all of these people who are in the class have similar interests and are affected by technology in the same way, then sometimes you can get a class certified.

Now, when it comes to, for example, the Doe 3 versus GitHub case, there are probably a lot of computer programmers out there who are pretty happy about the GitHub copilot because it's a tool that allows people to get a little help writing computer programs. So that may be a difficult class to certify. Because if the class members don't have the same interest, then you can't really get the class certified. But that's way down the line.

There are a number of other issues. In the Doe 3 versus GitHub case, the complaint talks about massive piracy and copyright infringement — but there's no copyright infringement claim in the case. So that's kind of curious. It has some novel theories which will be vigorously debated. I know the lawyers who work for Microsoft and GitHub personally, and I've talked to them, and I think they're pretty confident that they have a pretty good defense in that particular case.

The case that I think has the most legs is the Getty Images versus Stability case. Because Getty has a licensing program. So if you want to use the 12 million images that are part of the Getty website, they have a license system for training data. So it may be that that lawsuit will settle with a licensing agreement. That would probably be a good thing now if you can't really settle the Anderson class action, because Anderson and her co plaintiffs, they don't have the capacity to issue a license for visual art.

A lot of the stuff that's on the internet is actually photographs taken by people like me. So I would be a member of the class I suppose of visual artists, but I have different interests. I think there will be problems with the Silverman case too, because while some professional writers share her point of view, there are lots of authors, lots of writers, lots of blog posters, who have different points of views.

So why is copyright law “porous”?

It’s intentionally porous — copyright doesn't extend to ideas, copyright doesn't extend to facts, copyright doesn't extend to things that are common and works in this kind, things where there's only a small number of ways to express certain ideas and the subjects of an expressive work.

Whether it's a picture of a dog, or a song about a beautiful sunset… beautiful sunsets and dogs basically are not things that are within the scope of copyright. What I think the people who are developing the generative AI systems think that they're doing, is they're decomposing what is a copyrighted work into facts about the work.

[In other words,] what words tend to be next to each other? What words tend not to be next to each other? What kind of pixels can come next to one another for certain kinds of images? They can reassemble things so it's a little bit like magic actually.

The way you're describing predictive AI makes me think of it as Lego pieces. You're deconstructing a thing and you're building a different thing that might be similar but not exactly the same. And you’d argue that isn’t plagiarism?

The story tech people tell themselves [is] a powerful story. But of course, there are quite a few people who are professional writers and professional visual artists who feel threatened and who are just angry that you're building this model. The reason that [the models are] so good is that you built it on our work, and we didn't ask our permission and you didn't give us any money for it and that just feels unfair.

Some of them actually just want to get paid. And some of them want to say, no, you can't train on my data. That’s why there’s conflict.

What would be a strong litigation strategy if you're a plaintiff in one of these cases?

I would say that it's interesting that there hasn't been a big push to do generative AI music in the style of, for example, Taylor Swift. One thing that is known about the music industry, if you pay any attention to copyright stuff, is that the music industry really hates new technology and they are plenty happy to bring lawsuits. I listened to some part of the Copyright Office listening session in which people from the Recorded Music and the National Music Publishers Association are prepared to go to war.

So it's one thing to do open source software – which is what GitHub and OpenAI have been doing with [Microsoft] Co-pilot — and quite another thing to go up against the recording industry of America.

So where is all this legal back-and-forth going?

I think the Copyright Office will ask a series of questions. They will publish what's called a notice of inquiry and the notice of inquiry identifies a set of questions that the Copyright Office has gathered from the listening sessions that they held in the spring, and they want people to comment on it from a lot of different perspectives and one can count on certain actors contributing.

The recording industry will definitely put in some submissions and the Authors Guild will put in some submissions and some visual artists groups will do it, photographers will do it. But there will also be some from AI companies trying to explain why this is actually a good thing for copyright law, and why it is both lawful and also why it's consistent with the purposes of copyright — which is to promote the creation and dissemination of new works.

The office will try to sort through all of the different perspectives, and then come up with its own analysis of the issues, and then write a report, probably to some members of Congress and then Congress will probably hold some hearing. Usually Congress just let the courts decide these cases.

But there will be a lot of discussion about the issues and I think the Copyright Office is not very likely to issue a green light to all the generative AI companies. But of course, whatever they say they’re not the final word, because the courts actually have to make a decision about what the law is right now. And if they make recommendations that there should be a new law that won't help the plaintiff in these cases.