Home·Curated Catalog·NLP / Text
💬 Curated Catalog · NLP / Text

The Pile

825 GB diverse English text corpus for LLM pretraining, assembled from 22 high-quality sources.

LQS 80 · gold ⚠ Research-only 300M documents 825 GB JSONL Released 2021
Browse commercial NLP / Text → Visit original source ↗
Source: pile.eleuther.ai · maintained by EleutherAI
300M
documents
825 GB
Size on disk
80
LQS · gold
2021
First released

About this dataset

The Pile is EleutherAI's 825 GB English text corpus built from 22 high-quality sub-datasets: academic papers, code, books, web text, patents, StackExchange, etc. It was designed as a diverse pretraining corpus for large language models and has been used to train GPT-Neo, GPT-J, and related open-source LLMs.

Maintainer
Formats
JSONL

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

80
out of 100
gold tier

Solid dataset with some trade-offs

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 72
No public completeness metric; using prior for 'web_scrape' datasets.
Uniqueness 85
Near-duplicate filtering (MinHash / LSH / SimHash).
Validation 70
Unlabeled corpus — validation limited to format integrity.
Size adequacy 100
300,000,000 documents — exceeds 100,000 adequacy target for NLP / Text.
Format compliance 95
Industry-standard format — drop-in compatible with mainstream tooling.
Label density 0
Unlabeled corpus — label density not applicable.
Class balance 60
Unlabeled corpus — class balance not applicable.

What it's used for

Common tasks and benchmarks where The Pile is the default or competitive choice.

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

825 GB uncompressed text, ~300M documents across 22 sub-datasets (Pile-CC, PubMed, arXiv, GitHub, Books3, etc.). Note: Books3 subset later removed due to copyright concerns.

License

The Pile is distributed under MIT (mixed sub-licenses). This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Heads up: this dataset's license restricts commercial use. If you need nlp / text data for production, check LabelSets' paid datasets below — every listing has an explicit commercial license.

Need commercial-licensed NLP / Text data?

LabelSets sellers offer paid nlp / text datasets with what public datasets often can't give you:

Browse paid NLP / Text → Sell your dataset

Similar public datasets

Other entries in the NLP / Text catalog.

Frequently Asked Questions

The Pile is distributed under MIT (mixed sub-licenses), which restricts commercial use. For a commercially-licensed alternative in nlp / text, see LabelSets' paid datasets.
The Pile contains 300,000,000 documents. 825 GB uncompressed text, ~300M documents across 22 sub-datasets (Pile-CC, PubMed, arXiv, GitHub, Books3, etc.). Note: Books3 subset later removed due to copyright concerns.
The Pile is maintained by EleutherAI and is available at https://pile.eleuther.ai. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.
LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.