🔀 Curated Catalog · Multimodal

MSR-VTT — Microsoft Video-to-Text

10K web video clips with 200K human-written captions — the standard video captioning benchmark.

LQS 87 · gold ⚠ Research-only 200K video-caption pairs 7 GB MP4 · JSON Released 2016

Browse commercial Multimodal → Visit original source ↗

Source: cove.thecvf.com · maintained by Microsoft Research

About this dataset

MSR-VTT (MSR Video to Text) from Microsoft Research is the standard benchmark for video captioning and retrieval. 10,000 web video clips covering 20 categories, each paired with 20 human-written English captions (200,000 captions total). Widely used for video-language pretraining, video captioning, and cross-modal retrieval research.

Maintainer

Microsoft Research

License

Microsoft Research License (research use)

Formats

MP4 · JSON

Paper

Read on openaccess.thecvf.com →

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

out of 100

gold tier

High-quality dataset across most dimensions

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 88

No public completeness metric; using prior for 'crowdsourced_qc' datasets.

Uniqueness 90

Benchmark-grade splits with leakage prevention.

Validation 82

Crowdsourced labels with quality-control protocol (redundancy, golden tests).

Size adequacy 91

200,000 pairs — exceeds 100,000 adequacy target for Multimodal.

Format compliance 95

Industry-standard format — drop-in compatible with mainstream tooling.

Label density 93

Average 20.0 labels per item (high density).

Class balance 75

Moderate class skew — realistic production distribution.

What it's used for

Common tasks and benchmarks where MSR-VTT — Microsoft Video-to-Text is the default or competitive choice.

Video captioning
Video-language retrieval
Video question answering

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

10,000 video clips, 20 captions each (200,000 total), 20 categories. Avg clip length 15s. Human-written captions by AMT workers.

License

MSR-VTT — Microsoft Video-to-Text is distributed under Microsoft Research License (research use). This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Heads up: this dataset's license restricts commercial use. If you need multimodal data for production, check LabelSets' paid datasets below — every listing has an explicit commercial license.

Need commercial-licensed Multimodal data?

LabelSets sellers offer paid multimodal datasets with what public datasets often can't give you:

Explicit commercial license in writing
LQS-verified quality in your specific use-case
Instant download — no DUA, credentialed access, or research gating
PII scanned, deduplicated, and production-ready

Browse paid Multimodal → Sell your dataset

Frequently Asked Questions

MSR-VTT — Microsoft Video-to-Text is distributed under Microsoft Research License (research use), which restricts commercial use. For a commercially-licensed alternative in multimodal, see LabelSets' paid datasets.

MSR-VTT — Microsoft Video-to-Text contains 200,000 video-caption pairs. 10,000 video clips, 20 captions each (200,000 total), 20 categories. Avg clip length 15s. Human-written captions by AMT workers.

MSR-VTT — Microsoft Video-to-Text is maintained by Microsoft Research and is available at https://cove.thecvf.com/datasets/839. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.

LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.

MSR-VTT — Microsoft Video-to-Text

About this dataset

LabelSets Quality Score

High-quality dataset across most dimensions

What it's used for

Sample statistics

License

Need commercial-licensed Multimodal data?

Similar public datasets

Frequently Asked Questions