LAION Releases Updated Dataset, Re-LAION-5B, Free of Child Sexual Abuse Material
LAION, a prominent German research organization known for developing data to train generative AI models such as Stable Diffusion, has launched a new dataset designed to enhance safety and compliance with ethical standards. The dataset, named Re-LAION-5B, has undergone meticulous cleaning to remove all known links to suspected child sexual abuse material (CSAM).
This latest release is an iteration of the previous LAION-5B dataset, featuring improvements based on recommendations from several respected organizations, including the Internet Watch Foundation, Human Rights Watch, the Canadian Center for Child Protection, and the now-defunct Stanford Internet Observatory. Re-LAION-5B is available in two versions: Re-LAION-5B Research and Re-LAION-5B Research-Safe. The latter additionally filters out more adult content, addressing thousands of links associated with known and suspected CSAM.
In a blog post, LAION emphasized its commitment to eliminating illegal content from its datasets since its inception, stating, “LAION strictly adheres to the principle that illegal content is removed ASAP after it becomes known.”
It is crucial to clarify that LAION’s datasets do not include actual images. Instead, they consist of indexed links and alternative text for images curated from a broader dataset known as Common Crawl, which compiles information from a variety of online sources.
The initiative to release Re-LAION-5B follows troubling findings from a December 2023 investigation by the Stanford Internet Observatory, which revealed that a subset of LAION-5B, specifically LAION-5B 400M, contained at least 1,679 links to illegal images harvested from social media and adult websites. The التقرير also noted the dataset included a range of inappropriate material, such as pornographic images and harmful stereotypes.
In light of these findings, LAION decided to temporarily withdraw LAION-5B from circulation. The Stanford report recommended that all models trained on LAION-5B be deprecated and their distribution halted where possible. This recommendation notably coincided with AI startup Runway’s recent removal of its Stable Diffusion 1.5 model from the AI hosting platform Hugging Face.
The new Re-LAION-5B dataset comprises approximately 5.5 billion text-image pairs and is released under an Apache 2.0 license. LAION encourages third-party organizations to utilize the accompanying metadata to cleanse existing datasets of any illegal content.
While LAION asserts that its datasets are intended strictly for research purposes and should not be used commercially, history suggests that some organizations might disregard this guideline. Notably, Google’s image-generation models have previously employed LAION datasets in their training processes.
To emphasize its commitment to ethics, LAION stated, “In all, 2,236 links [to suspected CSAM] were removed after matching with the lists of link and image hashes provided by our partners.” The organization urges all research institutions still relying on the old LAION-5B dataset to transition to the newly released Re-LAION-5B as soon as possible.