first science publisher sues over scraped research papers

Elsevier is one of several publishers alleging that their copyrighted works were used to train AI models.Credit: Kristoffer Tripplaar/Alamy

A scientific publisher has joined the dozens of firms and individuals suing artificial intelligence companies over their alleged use of copyrighted works in training AI models.

Elsevier – which publishes thousands of journals, including Cell and The Lancet – was part of a class-action lawsuit filed on 5 May against technology company Meta and its chief executive Mark Zuckerberg in the Southern District of New York. Also named as plaintiffs on the lawsuit are book-publishing giants Hachette and Macmillan, and the US fiction author and lawyer Scott Turow. The publishers allege that Meta obtained and reproduced copyrighted works in developing its large language model (LLM) Llama.

“This case is the first AI action brought by major publishing houses, who have their own story to tell about Meta’s flagrant violation of their rights,” said the Association of American Publishers, in a statement.

The case mirrors those of authors and media companies – including The New York Times – suing AI firms on similar grounds. Some cases have been settled but, overall, they have yet to establish a clear precedent on whether it is legal to use copyrighted works to train an LLM. A Meta spokesperson has said the company would “fight this lawsuit aggressively”.

Although AI firms are cagey about their training data, it is widely assumed that paywalled research papers, as well as open-access ones, formed part of the billions of web pages that models were trained on.

Training data

To train Llama, the lawsuit alleges that Meta used the Common Crawl data set, a sample of billions of web pages made by trawling the Internet, which the plaintiffs say is likely to have included unauthorized copies of copyrighted works, such as scientific abstracts and paywalled papers.

The publishers also allege that Meta downloaded and torrented (sourced using a file-sharing method) works from sites including LibGen, a database of books, research papers and textbooks; and Sci-Hub, a repository that gives free access to millions of research articles and books regardless of copyright. Both sites have been the subject of legal challenges. Much of the evidence relies on e-mails between Meta employees that were revealed during a separate case in which several book authors sued Meta last year (Kadrey v. Meta).

Meta has suggested that it will argue that training on copyrighted documents constitutes ‘fair use’, a copyright exemption in US law. “AI is powering transformative innovations, productivity and creativity for individuals and companies, and courts have rightly found that training AI on copyrighted material can qualify as fair use,” its spokesperson said.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *