💬 Curated Catalog · NLP / Text

MS MARCO

8.8M passages and 1M anonymized Bing queries — the go-to benchmark for passage ranking.

LQS 84 · gold ⚠ Research-only 8.8M passages 3.3 GB TSV · JSON Released 2016

Browse commercial NLP / Text → Visit original source ↗

Source: microsoft.github.io · maintained by Microsoft Research

About this dataset

MS MARCO (Microsoft MAchine Reading COmprehension) is a collection of real anonymized Bing user queries paired with relevant passages. The passage ranking task has 8.8M passages and 1M queries; there are also QA, generation, and document ranking variants. Widely used for training dense retrievers and rerankers.

Maintainer

Microsoft Research

License

MS MARCO License (research only)

Formats

TSV · JSON

Paper

Read on arxiv.org →

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

out of 100

gold tier

Solid dataset with some trade-offs

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 88

No public completeness metric; using prior for 'crowdsourced_qc' datasets.

Uniqueness 93

Exact-hash deduplication documented by maintainer.

Validation 82

Crowdsourced labels with quality-control protocol (redundancy, golden tests).

Size adequacy 96

8,800,000 pairs — exceeds 100,000 adequacy target for NLP / Text.

Format compliance 95

Industry-standard format — drop-in compatible with mainstream tooling.

Label density 42

Average 0.1 labels per item (sparse).

Class balance 75

Moderate class skew — realistic production distribution.

What it's used for

Common tasks and benchmarks where MS MARCO is the default or competitive choice.

Passage ranking
Dense retrieval
Question answering
Document ranking

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

8.8M passages, 1M+ queries, ~533K training triples for passage ranking. Real Bing queries — highly representative of user intent.

License

MS MARCO is distributed under MS MARCO License (research only). This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Heads up: this dataset's license restricts commercial use. If you need nlp / text data for production, check LabelSets' paid datasets below — every listing has an explicit commercial license.

Need commercial-licensed NLP / Text data?

LabelSets sellers offer paid nlp / text datasets with what public datasets often can't give you:

Explicit commercial license in writing
LQS-verified quality in your specific use-case
Instant download — no DUA, credentialed access, or research gating
PII scanned, deduplicated, and production-ready

Browse paid NLP / Text → Sell your dataset

Frequently Asked Questions

MS MARCO is distributed under MS MARCO License (research only), which restricts commercial use. For a commercially-licensed alternative in nlp / text, see LabelSets' paid datasets.

MS MARCO contains 8,800,000 passages. 8.8M passages, 1M+ queries, ~533K training triples for passage ranking. Real Bing queries — highly representative of user intent.

MS MARCO is maintained by Microsoft Research and is available at https://microsoft.github.io/msmarco/Datasets.html. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.

LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.

MS MARCO

About this dataset

LabelSets Quality Score

Solid dataset with some trade-offs

What it's used for

Sample statistics

License

Need commercial-licensed NLP / Text data?

Similar public datasets

Frequently Asked Questions