365M cleaned web documents (156B tokens) — the pretraining corpus behind T5.
Browse commercial NLP / Text → Visit original source ↗C4 is the cleaned, deduplicated version of Common Crawl released by Google Research as the pretraining corpus for T5. 365M documents / 156B tokens of English web text passed through aggressive quality filters (bad-words removal, sentence-level dedup, language ID, minimum line length).
LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →
Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.
Common tasks and benchmarks where C4 — Colossal Clean Crawled Corpus is the default or competitive choice.
What's actually in the dataset — from the maintainer's published stats.
C4 — Colossal Clean Crawled Corpus is distributed under ODC-BY. This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.
LabelSets sellers offer paid nlp / text datasets with what public datasets often can't give you:
Other entries in the NLP / Text catalog.