Home·Curated Catalog·Multimodal
🔀 Curated Catalog · Multimodal

MSR-VTT — Microsoft Video-to-Text

10K web video clips with 200K human-written captions — the standard video captioning benchmark.

LQS 87 · gold ⚠ Research-only 200K video-caption pairs 7 GB MP4 · JSON Released 2016
Browse commercial Multimodal → Visit original source ↗
Source: cove.thecvf.com · maintained by Microsoft Research
200K
video-caption pairs
7 GB
Size on disk
87
LQS · gold
2016
First released

About this dataset

MSR-VTT (MSR Video to Text) from Microsoft Research is the standard benchmark for video captioning and retrieval. 10,000 web video clips covering 20 categories, each paired with 20 human-written English captions (200,000 captions total). Widely used for video-language pretraining, video captioning, and cross-modal retrieval research.

Maintainer
Formats
MP4 · JSON

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

87
out of 100
gold tier

High-quality dataset across most dimensions

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 88
No public completeness metric; using prior for 'crowdsourced_qc' datasets.
Uniqueness 90
Benchmark-grade splits with leakage prevention.
Validation 82
Crowdsourced labels with quality-control protocol (redundancy, golden tests).
Size adequacy 91
200,000 pairs — exceeds 100,000 adequacy target for Multimodal.
Format compliance 95
Industry-standard format — drop-in compatible with mainstream tooling.
Label density 93
Average 20.0 labels per item (high density).
Class balance 75
Moderate class skew — realistic production distribution.

What it's used for

Common tasks and benchmarks where MSR-VTT — Microsoft Video-to-Text is the default or competitive choice.

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

10,000 video clips, 20 captions each (200,000 total), 20 categories. Avg clip length 15s. Human-written captions by AMT workers.

License

MSR-VTT — Microsoft Video-to-Text is distributed under Microsoft Research License (research use). This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Heads up: this dataset's license restricts commercial use. If you need multimodal data for production, check LabelSets' paid datasets below — every listing has an explicit commercial license.

Need commercial-licensed Multimodal data?

LabelSets sellers offer paid multimodal datasets with what public datasets often can't give you:

Browse paid Multimodal → Sell your dataset

Similar public datasets

Other entries in the Multimodal catalog.

Frequently Asked Questions

MSR-VTT — Microsoft Video-to-Text is distributed under Microsoft Research License (research use), which restricts commercial use. For a commercially-licensed alternative in multimodal, see LabelSets' paid datasets.
MSR-VTT — Microsoft Video-to-Text contains 200,000 video-caption pairs. 10,000 video clips, 20 captions each (200,000 total), 20 categories. Avg clip length 15s. Human-written captions by AMT workers.
MSR-VTT — Microsoft Video-to-Text is maintained by Microsoft Research and is available at https://cove.thecvf.com/datasets/839. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.
LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.