Home·Curated Catalog·NLP / Text
💬 Curated Catalog · NLP / Text

Wikipedia (English Dump)

6.8M+ English articles — the most-used clean text corpus for pretraining and retrieval.

LQS 83 · gold ✓ Commercial OK 6.8M articles 22 GB XML · JSON Released 2001
Browse commercial NLP / Text → Visit original source ↗
Source: dumps.wikimedia.org · maintained by Wikimedia Foundation
6.8M
articles
22 GB
Size on disk
83
LQS · gold
2001
First released

About this dataset

The English Wikipedia XML dump is the single most widely-used clean text corpus in NLP. Updated monthly by the Wikimedia Foundation, it provides 6.8M+ articles with structured metadata (links, categories, infoboxes) and is a canonical input for pretraining LLMs, retrieval-augmented generation, and entity linking.

License
Formats
XML · JSON

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

83
out of 100
gold tier

Solid dataset with some trade-offs

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 88
No public completeness metric; using prior for 'crowdsourced_qc' datasets.
Uniqueness 93
Exact-hash deduplication documented by maintainer.
Validation 70
Unlabeled corpus — validation limited to format integrity.
Size adequacy 95
6,800,000 articles — exceeds 100,000 adequacy target for NLP / Text.
Format compliance 88
Custom format with published schema documentation.
Label density 0
Unlabeled corpus — label density not applicable.
Class balance 60
Unlabeled corpus — class balance not applicable.

What it's used for

Common tasks and benchmarks where Wikipedia (English Dump) is the default or competitive choice.

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

6.8M+ English articles, ~4.5B words, 22 GB compressed XML. Updated monthly. Structured metadata: links, categories, infoboxes, citations.

License

Wikipedia (English Dump) is distributed under CC BY-SA 4.0. This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Need commercial-licensed NLP / Text data?

LabelSets sellers offer paid nlp / text datasets with what public datasets often can't give you:

Browse paid NLP / Text → Sell your dataset

Similar public datasets

Other entries in the NLP / Text catalog.

Frequently Asked Questions

Wikipedia (English Dump) is distributed under CC BY-SA 4.0, which generally permits commercial use. Always verify the current license terms with the maintainer (Wikimedia Foundation) before using in a commercial product.
Wikipedia (English Dump) contains 6,800,000 articles. 6.8M+ English articles, ~4.5B words, 22 GB compressed XML. Updated monthly. Structured metadata: links, categories, infoboxes, citations.
Wikipedia (English Dump) is maintained by Wikimedia Foundation and is available at https://dumps.wikimedia.org/enwiki/. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.
LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.