Adobe faces copyright lawsuit over AI training data. Did they use pirated books?
So, here's the deal: Adobe, like everyone else, is diving headfirst into the AI game. They've launched a bunch of AI services, including Firefly, their AI-powered media suite. But, this enthusiasm for AI might have landed them in hot water. A new lawsuit alleges they used pirated books to train one of their AI models. Can you imagine?
This proposed class-action lawsuit, filed by author Elizabeth Lyon, claims Adobe used pirated versions of books, including hers, to train their SlimLM program. SlimLM, according to Adobe, is a language model optimized for mobile document assistance. It was supposedly trained on SlimPajama-627B, an open-source dataset. Lyon claims her works were included in the dataset Adobe used.
The lawsuit states that SlimPajama was created by copying and manipulating the RedPajama dataset, including Books3. For those of you not in the know, Books3 is a massive collection of 191,000 books used to train GenAI systems. It's been a source of legal headaches for the tech world. RedPajama has also been cited in other lawsuits. It seems everyone is doing it.
These lawsuits are becoming increasingly common. AI algorithms need massive datasets to learn, and sometimes, those datasets allegedly include pirated materials. In September, Anthropic agreed to cough up $1.5 billion to authors who accused them of using pirated work to train their chatbot, Claude. This case was a potential turning point in the legal battles over copyrighted material in AI training data.
I think what's happening here is a serious wake-up call for the entire tech industry. You can't just grab whatever data you find lying around and use it to train your AI models. There are copyright laws, people! It is kinda funny that a company dedicated to copyright, stealing content.
Source: TechCrunch