💬 Curated Catalog · NLP / Text

Wikipedia (English Dump)

6.8M+ English articles — the most-used clean text corpus for pretraining and retrieval.

LQS 83 · gold ✓ Commercial OK 6.8M articles 22 GB XML · JSON Released 2001

Browse commercial NLP / Text → Visit original source ↗

Source: dumps.wikimedia.org · maintained by Wikimedia Foundation

About this dataset

The English Wikipedia XML dump is the single most widely-used clean text corpus in NLP. Updated monthly by the Wikimedia Foundation, it provides 6.8M+ articles with structured metadata (links, categories, infoboxes) and is a canonical input for pretraining LLMs, retrieval-augmented generation, and entity linking.

Maintainer

Wikimedia Foundation

License

CC BY-SA 4.0

Formats

XML · JSON

Source

dumps.wikimedia.org →

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

out of 100

gold tier

Solid dataset with some trade-offs

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 88

No public completeness metric; using prior for 'crowdsourced_qc' datasets.

Uniqueness 93

Exact-hash deduplication documented by maintainer.

Validation 70

Unlabeled corpus — validation limited to format integrity.

Size adequacy 95

6,800,000 articles — exceeds 100,000 adequacy target for NLP / Text.

Format compliance 88

Custom format with published schema documentation.

Label density 0

Unlabeled corpus — label density not applicable.

Class balance 60

Unlabeled corpus — class balance not applicable.

What it's used for

Common tasks and benchmarks where Wikipedia (English Dump) is the default or competitive choice.

LLM pretraining
Entity linking
RAG
Fact verification

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

6.8M+ English articles, ~4.5B words, 22 GB compressed XML. Updated monthly. Structured metadata: links, categories, infoboxes, citations.

License

Wikipedia (English Dump) is distributed under CC BY-SA 4.0. This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Need commercial-licensed NLP / Text data?

LabelSets sellers offer paid nlp / text datasets with what public datasets often can't give you:

Explicit commercial license in writing
LQS-verified quality in your specific use-case
Instant download — no DUA, credentialed access, or research gating
PII scanned, deduplicated, and production-ready

Browse paid NLP / Text → Sell your dataset

Frequently Asked Questions

Wikipedia (English Dump) is distributed under CC BY-SA 4.0, which generally permits commercial use. Always verify the current license terms with the maintainer (Wikimedia Foundation) before using in a commercial product.

Wikipedia (English Dump) contains 6,800,000 articles. 6.8M+ English articles, ~4.5B words, 22 GB compressed XML. Updated monthly. Structured metadata: links, categories, infoboxes, citations.

Wikipedia (English Dump) is maintained by Wikimedia Foundation and is available at https://dumps.wikimedia.org/enwiki/. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.

LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.

Wikipedia (English Dump)

About this dataset

LabelSets Quality Score

Solid dataset with some trade-offs

What it's used for

Sample statistics

License

Need commercial-licensed NLP / Text data?

Similar public datasets

Frequently Asked Questions