Home·Curated Catalog·NLP / Text
💬 Curated Catalog · NLP / Text

Common Crawl

250+ billion web pages (petabytes) — the raw material behind most LLM pretraining.

LQS 69 · silver ✓ Commercial OK 250B web pages 8500 TB WARC · WET Released 2008
Browse commercial NLP / Text → Visit original source ↗
Source: commoncrawl.org · maintained by Common Crawl Foundation
250B
web pages
8500 TB
Size on disk
69
LQS · silver
2008
First released

About this dataset

Common Crawl is a nonprofit that publishes monthly web crawls. The cumulative archive holds 250B+ web pages across multiple petabytes of WARC files. Virtually every large language model — GPT, LLaMA, PaLM — is pretrained on some filtered/cleaned subset of Common Crawl (C4, RefinedWeb, FineWeb, etc.).

Formats
WARC · WET · WAT

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

69
out of 100
silver tier

Useful scale, quality trade-offs to review

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 65
No public completeness metric; using prior for 'raw_web_crawl' datasets.
Uniqueness 42
Raw web crawl — substantial duplication expected (~30%+ typical).
Validation 70
Unlabeled corpus — validation limited to format integrity.
Size adequacy 100
250,000,000,000 documents — exceeds 100,000 adequacy target for NLP / Text.
Format compliance 82
Custom format, documented but non-standard.
Label density 0
Unlabeled corpus — label density not applicable.
Class balance 60
Unlabeled corpus — class balance not applicable.

What it's used for

Common tasks and benchmarks where Common Crawl is the default or competitive choice.

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

250B+ web pages in cumulative archive. Monthly crawls of 3-5B pages, ~250 TB per crawl. Filtered downstream corpora (C4, RefinedWeb) are the practical entry point.

License

Common Crawl is distributed under Terms of use (redistribute with attribution). This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Need commercial-licensed NLP / Text data?

LabelSets sellers offer paid nlp / text datasets with what public datasets often can't give you:

Browse paid NLP / Text → Sell your dataset

Similar public datasets

Other entries in the NLP / Text catalog.

Frequently Asked Questions

Common Crawl is distributed under Terms of use (redistribute with attribution), which generally permits commercial use. Always verify the current license terms with the maintainer (Common Crawl Foundation) before using in a commercial product.
Common Crawl contains 250,000,000,000 web pages. 250B+ web pages in cumulative archive. Monthly crawls of 3-5B pages, ~250 TB per crawl. Filtered downstream corpora (C4, RefinedWeb) are the practical entry point.
Common Crawl is maintained by Common Crawl Foundation and is available at https://commoncrawl.org/get-started. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.
LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.