Works by thousands of authors also including Margaret Atwood, Haruki Murakami and Jonathan Franzen fed into models run by firms including Meta and Bloomberg
Zadie Smith, Stephen King, Rachel Cusk and Elena Ferrante are among thousands of authors whose pirated works have been used to train artificial intelligence tools, a story in The Atlantic has revealed.
More than 170,000 titles were fed into models run by companies including Meta and Bloomberg, according to an analysis of “Books3” – the dataset harnessed by the firms to build their AI tools.
Books3 was used to train Meta’s LLaMA, one of a number of large language models – the best-known of which is OpenAI’s ChatGPT – that can generate content based on patterns identified in sample texts. The dataset was also used to train Bloomberg’s BloombergGPT, EleutherAI’s GPT-J and it is “likely” it has been used in other AI models.
The titles contained in Books3 are roughly one-third fiction and two-thirds nonfiction, and the majority were published within the last two decades. Along with Smith, King, Cusk and Ferrante’s writing, copyrighted works in the dataset include 33 books by Margaret Atwood, at least nine by Haruki Murakami, nine by bell hooks, seven by Jonathan Franzen, five by Jennifer Egan and five by David Grann.
Books by George Saunders, Junot Díaz, Michael Pollan, Rebecca Solnit and Jon Krakauer also feature, as well as 102 pulp novels by Scientology founder L Ron Hubbard and 90 books by pastor John MacArthur.
The titles span large and small publishers including more than 30,000 published by Penguin Random House, 14,000 by HarperCollins, 7,000 by Macmillan, 1,800 by Oxford University Press and 600 by Verso.
This comes after a lawsuit filed last month by three writers – Sarah Silverman, Richard Kadrey, and Christopher Golden – alleged that their copyrighted works “were copied and ingested as part of training” Meta’s LLaMA. The analysis revealed that the three plaintiffs’ writings are indeed part of Books3.
OpenAI, the company behind AI chatbot ChatGPT, has also been accused of training its model on copyrighted works. Clues to the sources of OpenAI’s training data lie in a paper released by the company in 2020 that mentions two “internet-based books corpora”, one of which is called Books2 and is estimated to contain nearly 300,000 titles. A June lawsuit states that the only websites to offer that much material are “shadow libraries” such as Library Genesis (LibGen) and Z-Library, through which books can be secured in bulk via torrent systems.
Shawn Presser, the independent AI developer who originally created Books3, said that while he is sympathetic to authors’ concerns, he made the database so that anyone could develop generative AI tools and worries about the risks of large companies having control of the technology.
While a Meta spokesperson declined to comment on the firm’s use of Books3 to The Atlantic, a Bloomberg spokesperson confirmed that the company did use the dataset. “We will not include the Books3 dataset among the data sources used to train future versions of BloombergGPT,” they added.