⚖️ Curated Catalog · Legal

Pile of Law

256 GB of legal text — court opinions, contracts, statutes, and regulatory filings.

LQS 84 · gold ⚠ Research-only 10M legal documents 256 GB JSONL Released 2022
Browse commercial Legal → Visit original source ↗
Source: huggingface.co · maintained by Stanford CRFM (Henderson et al.)
10M
legal documents
256 GB
Size on disk
84
LQS · gold
2022
First released

About this dataset

Pile of Law is a 256 GB corpus of English legal and administrative text from Stanford CRFM. It includes federal/state court opinions, SEC filings, administrative agency rulings, the Code of Federal Regulations, state statutes, and dozens of other sources. Designed as a domain-specific pretraining corpus for legal language models.

License
Formats
JSONL

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

84
out of 100
gold tier

Solid dataset with some trade-offs

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 90
No public completeness metric; using prior for 'governmental' datasets.
Uniqueness 85
Near-duplicate filtering (MinHash / LSH / SimHash).
Validation 70
Unlabeled corpus — validation limited to format integrity.
Size adequacy 100
10,000,000 documents — exceeds 5,000 adequacy target for Legal.
Format compliance 95
Industry-standard format — drop-in compatible with mainstream tooling.
Label density 0
Unlabeled corpus — label density not applicable.
Class balance 60
Unlabeled corpus — class balance not applicable.

What it's used for

Common tasks and benchmarks where Pile of Law is the default or competitive choice.

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

256 GB text across 30+ sources: court opinions (federal + state), SEC filings, regulations (CFR), state statutes, tax rulings, patents.

License

Pile of Law is distributed under CC BY-NC-SA 4.0. This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Heads up: this dataset's license restricts commercial use. If you need legal data for production, check LabelSets' paid datasets below — every listing has an explicit commercial license.

Need commercial-licensed Legal data?

LabelSets sellers offer paid legal datasets with what public datasets often can't give you:

Browse paid Legal → Sell your dataset

Similar public datasets

Other entries in the Legal catalog.

Frequently Asked Questions

Pile of Law is distributed under CC BY-NC-SA 4.0, which restricts commercial use. For a commercially-licensed alternative in legal, see LabelSets' paid datasets.
Pile of Law contains 10,000,000 legal documents. 256 GB text across 30+ sources: court opinions (federal + state), SEC filings, regulations (CFR), state statutes, tax rulings, patents.
Pile of Law is maintained by Stanford CRFM (Henderson et al.) and is available at https://huggingface.co/datasets/pile-of-law/pile-of-law. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.
LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.