⚖️ Curated Catalog · Legal

Pile of Law

256 GB of legal text — court opinions, contracts, statutes, and regulatory filings.

LQS 84 · gold ⚠ Research-only 10M legal documents 256 GB JSONL Released 2022

Browse commercial Legal → Visit original source ↗

Source: huggingface.co · maintained by Stanford CRFM (Henderson et al.)

About this dataset

Pile of Law is a 256 GB corpus of English legal and administrative text from Stanford CRFM. It includes federal/state court opinions, SEC filings, administrative agency rulings, the Code of Federal Regulations, state statutes, and dozens of other sources. Designed as a domain-specific pretraining corpus for legal language models.

Maintainer

Stanford CRFM (Henderson et al.)

License

CC BY-NC-SA 4.0

Formats

JSONL

Paper

Read on arxiv.org →

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

out of 100

gold tier

Solid dataset with some trade-offs

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 90

No public completeness metric; using prior for 'governmental' datasets.

Uniqueness 85

Near-duplicate filtering (MinHash / LSH / SimHash).

Validation 70

Unlabeled corpus — validation limited to format integrity.

Size adequacy 100

10,000,000 documents — exceeds 5,000 adequacy target for Legal.

Format compliance 95

Industry-standard format — drop-in compatible with mainstream tooling.

Label density 0

Unlabeled corpus — label density not applicable.

Class balance 60

Unlabeled corpus — class balance not applicable.

What it's used for

Common tasks and benchmarks where Pile of Law is the default or competitive choice.

Legal LLM pretraining
Legal domain adaptation
Retrieval over case law

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

256 GB text across 30+ sources: court opinions (federal + state), SEC filings, regulations (CFR), state statutes, tax rulings, patents.

License

Pile of Law is distributed under CC BY-NC-SA 4.0. This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Heads up: this dataset's license restricts commercial use. If you need legal data for production, check LabelSets' paid datasets below — every listing has an explicit commercial license.

Need commercial-licensed Legal data?

LabelSets sellers offer paid legal datasets with what public datasets often can't give you:

Explicit commercial license in writing
LQS-verified quality in your specific use-case
Instant download — no DUA, credentialed access, or research gating
PII scanned, deduplicated, and production-ready

Browse paid Legal → Sell your dataset

Frequently Asked Questions

Pile of Law is distributed under CC BY-NC-SA 4.0, which restricts commercial use. For a commercially-licensed alternative in legal, see LabelSets' paid datasets.

Pile of Law contains 10,000,000 legal documents. 256 GB text across 30+ sources: court opinions (federal + state), SEC filings, regulations (CFR), state statutes, tax rulings, patents.

Pile of Law is maintained by Stanford CRFM (Henderson et al.) and is available at https://huggingface.co/datasets/pile-of-law/pile-of-law. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.

LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.

Pile of Law

About this dataset

LabelSets Quality Score

Solid dataset with some trade-offs

What it's used for

Sample statistics

License

Need commercial-licensed Legal data?

Similar public datasets

Frequently Asked Questions