💬 Curated Catalog · NLP / Text

Common Crawl

250+ billion web pages (petabytes) — the raw material behind most LLM pretraining.

LQS 69 · silver ✓ Commercial OK 250B web pages 8500 TB WARC · WET Released 2008

Browse commercial NLP / Text → Visit original source ↗

Source: commoncrawl.org · maintained by Common Crawl Foundation

About this dataset

Common Crawl is a nonprofit that publishes monthly web crawls. The cumulative archive holds 250B+ web pages across multiple petabytes of WARC files. Virtually every large language model — GPT, LLaMA, PaLM — is pretrained on some filtered/cleaned subset of Common Crawl (C4, RefinedWeb, FineWeb, etc.).

Maintainer

Common Crawl Foundation

License

Formats

WARC · WET · WAT

Source

commoncrawl.org →

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

out of 100

silver tier

Useful scale, quality trade-offs to review

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 65

No public completeness metric; using prior for 'raw_web_crawl' datasets.

Uniqueness 42

Raw web crawl — substantial duplication expected (~30%+ typical).

Validation 70

Unlabeled corpus — validation limited to format integrity.

Size adequacy 100

250,000,000,000 documents — exceeds 100,000 adequacy target for NLP / Text.

Format compliance 82

Custom format, documented but non-standard.

Label density 0

Unlabeled corpus — label density not applicable.

Class balance 60

Unlabeled corpus — class balance not applicable.

What it's used for

Common tasks and benchmarks where Common Crawl is the default or competitive choice.

LLM pretraining
Web search research
Entity extraction at scale

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

250B+ web pages in cumulative archive. Monthly crawls of 3-5B pages, ~250 TB per crawl. Filtered downstream corpora (C4, RefinedWeb) are the practical entry point.

License

Common Crawl is distributed under Terms of use (redistribute with attribution). This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Need commercial-licensed NLP / Text data?

LabelSets sellers offer paid nlp / text datasets with what public datasets often can't give you:

Explicit commercial license in writing
LQS-verified quality in your specific use-case
Instant download — no DUA, credentialed access, or research gating
PII scanned, deduplicated, and production-ready

Browse paid NLP / Text → Sell your dataset

Frequently Asked Questions

Common Crawl is distributed under Terms of use (redistribute with attribution), which generally permits commercial use. Always verify the current license terms with the maintainer (Common Crawl Foundation) before using in a commercial product.

Common Crawl contains 250,000,000,000 web pages. 250B+ web pages in cumulative archive. Monthly crawls of 3-5B pages, ~250 TB per crawl. Filtered downstream corpora (C4, RefinedWeb) are the practical entry point.

Common Crawl is maintained by Common Crawl Foundation and is available at https://commoncrawl.org/get-started. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.

LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.

Common Crawl

About this dataset

LabelSets Quality Score

Useful scale, quality trade-offs to review

What it's used for

Sample statistics

License

Need commercial-licensed NLP / Text data?

Similar public datasets

Frequently Asked Questions