825 GB diverse English text corpus for LLM pretraining, assembled from 22 high-quality sources.
Browse commercial NLP / Text → Visit original source ↗The Pile is EleutherAI's 825 GB English text corpus built from 22 high-quality sub-datasets: academic papers, code, books, web text, patents, StackExchange, etc. It was designed as a diverse pretraining corpus for large language models and has been used to train GPT-Neo, GPT-J, and related open-source LLMs.
LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →
Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.
Common tasks and benchmarks where The Pile is the default or competitive choice.
What's actually in the dataset — from the maintainer's published stats.
The Pile is distributed under MIT (mixed sub-licenses). This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.
LabelSets sellers offer paid nlp / text datasets with what public datasets often can't give you:
Other entries in the NLP / Text catalog.