💬 Curated Catalog · NLP / Text

C4 — Colossal Clean Crawled Corpus

365M cleaned web documents (156B tokens) — the pretraining corpus behind T5.

LQS 80 · gold ✓ Commercial OK 365M documents 305 GB JSONL Released 2020

Browse commercial NLP / Text → Visit original source ↗

Source: huggingface.co · maintained by Google Research (AllenAI hosts public copy)

About this dataset

C4 is the cleaned, deduplicated version of Common Crawl released by Google Research as the pretraining corpus for T5. 365M documents / 156B tokens of English web text passed through aggressive quality filters (bad-words removal, sentence-level dedup, language ID, minimum line length).

Maintainer

Google Research (AllenAI hosts public copy)

License

ODC-BY

Formats

JSONL

Paper

Read on arxiv.org →

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

out of 100

gold tier

Solid dataset with some trade-offs

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 72

No public completeness metric; using prior for 'web_scrape' datasets.

Uniqueness 85

Near-duplicate filtering (MinHash / LSH / SimHash).

Validation 70

Unlabeled corpus — validation limited to format integrity.

Size adequacy 100

365,000,000 documents — exceeds 100,000 adequacy target for NLP / Text.

Format compliance 95

Industry-standard format — drop-in compatible with mainstream tooling.

Label density 0

Unlabeled corpus — label density not applicable.

Class balance 60

Unlabeled corpus — class balance not applicable.

What it's used for

Common tasks and benchmarks where C4 — Colossal Clean Crawled Corpus is the default or competitive choice.

LLM pretraining
Masked language modeling
Span corruption

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

365M documents, 156B tokens, 305 GB compressed. Aggressive filtering removed placeholder text, offensive content, and non-English.

License

C4 — Colossal Clean Crawled Corpus is distributed under ODC-BY. This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Need commercial-licensed NLP / Text data?

LabelSets sellers offer paid nlp / text datasets with what public datasets often can't give you:

Explicit commercial license in writing
LQS-verified quality in your specific use-case
Instant download — no DUA, credentialed access, or research gating
PII scanned, deduplicated, and production-ready

Browse paid NLP / Text → Sell your dataset

Similar public datasets

Other entries in the NLP / Text catalog.

Frequently Asked Questions

C4 — Colossal Clean Crawled Corpus is distributed under ODC-BY, which generally permits commercial use. Always verify the current license terms with the maintainer (Google Research (AllenAI hosts public copy)) before using in a commercial product.

C4 — Colossal Clean Crawled Corpus contains 365,000,000 documents. 365M documents, 156B tokens, 305 GB compressed. Aggressive filtering removed placeholder text, offensive content, and non-English.

C4 — Colossal Clean Crawled Corpus is maintained by Google Research (AllenAI hosts public copy) and is available at https://huggingface.co/datasets/allenai/c4. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.

LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.