EleutherAI releases massive AI training dataset of licensed and open domain text

June 8, 2025

56

EleutherAI, an AI analysis group, has launched what it claims is among the largest collections of licensed and open-domain textual content for coaching AI fashions.

The dataset, known as the Frequent Pile v0.1, took round two years to finish in collaboration with AI startups Poolside, Hugging Face, and others, together with a number of tutorial establishments. Weighing in at 8 terabytes in measurement, the Frequent Pile v0.1 was used to coach two new AI fashions from EleutherAI, Comma v0.1-1T and Comma v0.1-2T, that EleutherAI claims carry out on par with fashions developed utilizing unlicensed, copyrighted knowledge.

AI corporations, together with OpenAI, are embroiled in lawsuits over their AI coaching practices, which depend on scraping the online — together with copyrighted materials like books and analysis journals — to construct mannequin coaching datasets. Whereas some AI corporations have licensing preparations in place with sure content material suppliers, most keep that the U.S. authorized doctrine of truthful use shields them from legal responsibility in circumstances the place they skilled on copyrighted work with out permission.

EleutherAI argues that these lawsuits have “drastically decreased” transparency from AI corporations, which the group says has harmed the broader AI analysis area by making it extra obscure how fashions work and what their flaws may be.

“[Copyright] lawsuits haven’t meaningfully modified knowledge sourcing practices in [model] coaching, however they’ve drastically decreased the transparency corporations have interaction in,” Stella Biderman, EleutherAI’s govt director, wrote in a weblog publish on Hugging Face early Friday. “Researchers at some corporations we’ve got spoken to have additionally particularly cited lawsuits as the explanation why they’ve been unable to launch the analysis they’re doing in extremely data-centric areas.”

The Frequent Pile v0.1, which will be downloaded from Hugging Face’s AI dev platform and GitHub, was created in session with authorized consultants, and it attracts on sources, together with 300,000 public area books digitized by the Library of Congress and the Web Archive. EleutherAI additionally used Whisper, OpenAI’s open supply speech-to-text mannequin, to transcribe audio content material.

EleutherAI claims Comma v0.1-1T and Comma v0.1-2T are proof that the Frequent Pile v0.1 was curated rigorously sufficient to allow builders to construct fashions aggressive with proprietary options. In line with EleutherAI, the fashions, each of that are 7 billion parameters in measurement and had been skilled on solely a fraction of the Frequent Pile v0.1, rival fashions like Meta’s first Llama AI mannequin on benchmarks for coding, picture understanding, and math.

Parameters, generally known as weights, are the interior parts of an AI mannequin that information its conduct and solutions.

“Normally, we predict that the frequent concept that unlicensed textual content drives efficiency is unjustified,” Biderman wrote in her publish. “As the quantity of accessible overtly licensed and public area knowledge grows, we are able to anticipate the standard of fashions skilled on overtly licensed content material to enhance.”

The Frequent Pile v0.1 seems to be partially an effort to proper EleutherAI’s historic wrongs. Years in the past, the corporate launched The Pile, an open assortment of coaching textual content that features copyrighted materials. AI corporations have come below fireplace — and authorized stress — for utilizing The Pile to coach fashions.

EleutherAI is committing to releasing open datasets extra regularly going ahead in collaboration with its analysis and infrastructure companions.

Up to date 9:48 a.m. Pacific: Biderman clarified in a publish on X that EleutherAI contributed to the discharge of the datasets and fashions, however that their improvement concerned many companions, together with the College of Toronto, which helped lead the analysis.

Supply hyperlink

Tags
AI
AI News

Buy now

EleutherAI releases massive AI training dataset of licensed and open domain text

Related Articles

Windows 11 users hit with bizarre Task Manager duplication bug –...

Grammarly rebrands to ‘Superhuman,’ launches a new AI assistant

AI Driven Demand Forecasting and Dynamic Pricing Model for E-commerce

Leave a Reply Cancel reply

Latest Articles

Windows 11 users hit with bizarre Task Manager duplication bug –...

Grammarly rebrands to ‘Superhuman,’ launches a new AI assistant

AI Driven Demand Forecasting and Dynamic Pricing Model for E-commerce

How to remotely access and control someone else’s iPhone (with their...

How AI labs use Mercor to get the data companies won’t...