As AI creative tools are becoming widespread, the question of copyright of AI creations has also taken centre-stage. But while copyright nerds obsess over the authorship question, the issue that is getting more attention from artists is that of copyright infringement.
AI is trained on data, in the case of graphic tools such as Imagen, Stable Diffusion, DALL·E, and MidJourney, the training sets consist of terabytes of images comprising photographs, paintings, drawings, logos, and anything else with a graphical representation. The complaint by some artists is that these models (and accompanying commercialisation) are being built on the backs of human artists, photographers, and designers, who are not seeing any benefit from these business models. The language gets very animated in some forums and chats rooms, often using terms such as “theft” and “exploitation”. So is this copyright infringement? Are OpenAI and Google about to get sued by artists and photographers from around the world?
This is a question that has two parts, the input phase and the output phase.
The explosion in the sophistication of AI tools has come because of two important developments, firstly the improvement and variety of training models, but most importantly, the availability of large training datasets. The first source of works stems from open access or public domain works, these are sources that are licensed under permissible licences such as Creative Commons (example here), or they’re works that are in the public domain (example here). But of course the amount of such datasets is limited, so researchers can have access to many other datasets, some are even free (lists here and here).
But researchers may also want to try and scrape images from the largest image repository in the world: the Internet. Can they do that? There’s growing recognition that mining data (in this case in the shape of images) is allowed under copyright as fair use or fair dealing. The earliest source of an exception for training an AI can be found in the United States in the shape of the Google Books case. This was a long-running dispute between Authors Guild and Google over scanning books for a service called Google Print (later renamed Google Book Search). After a lengthy battle involving settlements and appeals, the court decided that Google’s scanning was fair use, the transformative nature of the scanning played a big part in the decision, as well as the fact that the copying would not affect the market for book sales online, the purpose of the Google database was to make the works available to libraries, as well as to provide snippets in search results.
While Google Books does not deal specifically with machine learning, it is similar in many ways to what happens in most machine learning training, there is copying of large amounts of works to produce something different.
In the EU, the Digital Single Market Directive has also opened the door for wider adoption of text and data mining. In Art 3 the Directive sets out a new exception for copyright for “reproductions and extractions made by research organisations and cultural heritage institutions in order to carry out, for the purposes of scientific research, text and data mining of works or other subject matter to which they have lawful access.” Art 4 extends this permission to commercial organisations for any purpose, as long as they have lawful access to the work, and also give rightsholders the opportunity to opt-out of this exception.
The end result of the above is that a large number of commercial entities operating both in the US and Europe are able to scrape images from the Internet for the purpose of data mining, and they can make reproduction and extraction of such materials. Furthermore, other countries such as the UK and Japan have similar exceptions.
Between open data, public domain images, and the data mining exceptions, this means that we can assume that the vast majority of training for machine learning is lawful. While it is possible to imagine some data being gathered and used unlawfully, I cannot imagine that the biggest organisations involved in AI are infringing the law in this respect.
Assuming a lot of the inputs that go into training AI are lawful, then what about the outputs? Could a work that has been generated by an AI trained on existing works infringe copyright?
This is trickier to answer, and it may very well depend on what happens during and after the training, and how the outputs are generated, so we have to look in more detail under the hood at machine learning methods. A big warning first, obviously I’m no ML expert, and while I have been reading a lot of the basic literature for a few years now, my understanding is that of a hobbyist, if I misrepresent the technology it is my own fault, and will be delighted to correct any mistakes. I will of course be over-simplifying some stuff.
The main idea behind creative AI is to train a system in a way that it can generate outputs that statistically resemble their training data, in other words, in order to generate poetry, you train the AI with poetry, if you want it to generate faces, you train it with faces. There are various models for generative AI, but the two main ones are generative adversarial networks (GANs) and diffusion models.
GAN is a model that uses two agents set against each other (hence the adversarial) in order to generate better outputs. There is a generator, which generates output based on a training dataset, and there is a discriminator, which compares the generated output against the training data in order to discern if it resembles it, and if it does not then it is discarded in favour of outputs that resemble the input.
Read the full article with visualisations here.