💬 Curated Catalog · NLP / Text

The Pile

825 GB diverse English text corpus for LLM pretraining, assembled from 22 high-quality sources.

LQS 80 · gold ⚠ Research-only 300M documents 825 GB JSONL Released 2021

Browse commercial NLP / Text → Visit original source ↗

Source: pile.eleuther.ai · maintained by EleutherAI

About this dataset

The Pile is EleutherAI's 825 GB English text corpus built from 22 high-quality sub-datasets: academic papers, code, books, web text, patents, StackExchange, etc. It was designed as a diverse pretraining corpus for large language models and has been used to train GPT-Neo, GPT-J, and related open-source LLMs.

Maintainer

EleutherAI

License

MIT (mixed sub-licenses)

Formats

JSONL

Paper

Read on arxiv.org →

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

out of 100

gold tier

Solid dataset with some trade-offs

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 72

No public completeness metric; using prior for 'web_scrape' datasets.

Uniqueness 85

Near-duplicate filtering (MinHash / LSH / SimHash).

Validation 70

Unlabeled corpus — validation limited to format integrity.

Size adequacy 100

300,000,000 documents — exceeds 100,000 adequacy target for NLP / Text.

Format compliance 95

Industry-standard format — drop-in compatible with mainstream tooling.

Label density 0

Unlabeled corpus — label density not applicable.

Class balance 60

Unlabeled corpus — class balance not applicable.

What it's used for

Common tasks and benchmarks where The Pile is the default or competitive choice.

LLM pretraining
Language modeling
Perplexity benchmarking

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

825 GB uncompressed text, ~300M documents across 22 sub-datasets (Pile-CC, PubMed, arXiv, GitHub, Books3, etc.). Note: Books3 subset later removed due to copyright concerns.

License

The Pile is distributed under MIT (mixed sub-licenses). This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Heads up: this dataset's license restricts commercial use. If you need nlp / text data for production, check LabelSets' paid datasets below — every listing has an explicit commercial license.

Need commercial-licensed NLP / Text data?

LabelSets sellers offer paid nlp / text datasets with what public datasets often can't give you:

Explicit commercial license in writing
LQS-verified quality in your specific use-case
Instant download — no DUA, credentialed access, or research gating
PII scanned, deduplicated, and production-ready

Browse paid NLP / Text → Sell your dataset

Frequently Asked Questions

The Pile is distributed under MIT (mixed sub-licenses), which restricts commercial use. For a commercially-licensed alternative in nlp / text, see LabelSets' paid datasets.

The Pile contains 300,000,000 documents. 825 GB uncompressed text, ~300M documents across 22 sub-datasets (Pile-CC, PubMed, arXiv, GitHub, Books3, etc.). Note: Books3 subset later removed due to copyright concerns.

The Pile is maintained by EleutherAI and is available at https://pile.eleuther.ai. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.

LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.

The Pile

About this dataset

LabelSets Quality Score

Solid dataset with some trade-offs

What it's used for

Sample statistics

License

Need commercial-licensed NLP / Text data?

Similar public datasets

Frequently Asked Questions