Home·Curated Catalog·NLP / Text
💬 Curated Catalog · NLP / Text

MS MARCO

8.8M passages and 1M anonymized Bing queries — the go-to benchmark for passage ranking.

LQS 84 · gold ⚠ Research-only 8.8M passages 3.3 GB TSV · JSON Released 2016
Browse commercial NLP / Text → Visit original source ↗
Source: microsoft.github.io · maintained by Microsoft Research
8.8M
passages
3.3 GB
Size on disk
84
LQS · gold
2016
First released

About this dataset

MS MARCO (Microsoft MAchine Reading COmprehension) is a collection of real anonymized Bing user queries paired with relevant passages. The passage ranking task has 8.8M passages and 1M queries; there are also QA, generation, and document ranking variants. Widely used for training dense retrievers and rerankers.

Maintainer
Formats
TSV · JSON

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

84
out of 100
gold tier

Solid dataset with some trade-offs

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 88
No public completeness metric; using prior for 'crowdsourced_qc' datasets.
Uniqueness 93
Exact-hash deduplication documented by maintainer.
Validation 82
Crowdsourced labels with quality-control protocol (redundancy, golden tests).
Size adequacy 96
8,800,000 pairs — exceeds 100,000 adequacy target for NLP / Text.
Format compliance 95
Industry-standard format — drop-in compatible with mainstream tooling.
Label density 42
Average 0.1 labels per item (sparse).
Class balance 75
Moderate class skew — realistic production distribution.

What it's used for

Common tasks and benchmarks where MS MARCO is the default or competitive choice.

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

8.8M passages, 1M+ queries, ~533K training triples for passage ranking. Real Bing queries — highly representative of user intent.

License

MS MARCO is distributed under MS MARCO License (research only). This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Heads up: this dataset's license restricts commercial use. If you need nlp / text data for production, check LabelSets' paid datasets below — every listing has an explicit commercial license.

Need commercial-licensed NLP / Text data?

LabelSets sellers offer paid nlp / text datasets with what public datasets often can't give you:

Browse paid NLP / Text → Sell your dataset

Similar public datasets

Other entries in the NLP / Text catalog.

Frequently Asked Questions

MS MARCO is distributed under MS MARCO License (research only), which restricts commercial use. For a commercially-licensed alternative in nlp / text, see LabelSets' paid datasets.
MS MARCO contains 8,800,000 passages. 8.8M passages, 1M+ queries, ~533K training triples for passage ranking. Real Bing queries — highly representative of user intent.
MS MARCO is maintained by Microsoft Research and is available at https://microsoft.github.io/msmarco/Datasets.html. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.
LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.