Over 170,000 Pirated Works Used to Train Meta’s LLama and Bloomberg AI Models

As AI subsidies continue to flow to big data giants, The Atlantic revealed that that Zadie Smith, Stephen King, Rachel Cusk and Elena Ferrante are among thousands of authors whose pirated works have been used to train artificial intelligence tools. According to an analysis of “Books3” – the dataset harnessed by the firms to build their AI tools – more than 170,000 titles published in the past 20 years were fed into AI models run by companies including Meta and Bloomberg.

One of the most troubling issues around generative AI is simple: It’s being made in secret.
Alex Reisner, The Atlantic

Books3 was used to train Meta’s LLaMA, one of a number of large language models – the best-known of which is OpenAI’s ChatGPT that can generate content based on patterns identified in sample texts. The dataset was also used to train Bloomberg’s BloombergGPT, EleutherAI’s GPT-J and it is “likely” it has been used in other AI models.

The titles contained in Books3 are roughly one-third fiction and two-thirds nonfiction and the majority are recent. Along with Smith, King, Cusk and Ferrante’s writing, copyrighted works in the dataset include 33 books by Margaret Atwood, at least nine by Haruki Murakami, nine by bell hooks, seven by Jonathan Franzen, five by Jennifer Egan and five by David Grann. Books by George Saunders, Junot DÃaz, Michael Pollan, Rebecca Solnit and Jon Krakauer also feature, as well as 102 pulp novels by Scientology founder L Ron Hubbard and 90 books by pastor John MacArthur. The titles span large and small publishers including more than 30,000 published by Penguin Random House, 14,000 by HarperCollins, 7,000 by Macmillan, 1,800 by Oxford University Press and 600 by Verso.

Links to the original articles (I am reading these right now).

https://www.theatlantic.com/technology/archive/2023/08/books3-ai-meta-llama-pirated-books/675063/

https://www.theguardian.com/books/2023/aug/22/zadie-smith-stephen-king-and-rachel-cusks-pirated-works-used-to-train-ai