Curated Catalog

44 Real Public ML Datasets,
LQS-Scored and Attributed

An independently-curated index of public datasets used across ML research and production. Every entry has verified attribution, license info, and a LabelSets Quality Score. We don't host these — we link to the original maintainer.

Browse paid datasets → Jump to catalog ↓
44
Datasets indexed
7.2B+
Labeled items
9
Categories
26
Commercial-use OK
🎙Audio5🚗Autonomous Vehicles4👁Computer Vision11📄Document / OCR1📈Financial / Crypto3⚖️Legal3🏥Medical Imaging5🔀Multimodal4💬NLP / Text8
🎙

Audio

5 datasets
LibriSpeech
LQS 89
1,000 hours of read English audiobook speech — the standard ASR benchmark.
292K speech utterances · CC BY 4.0
Mozilla Common Voice
LQS 79
Crowdsourced multilingual voice dataset — 20K+ hours across 100+ languages.
14M validated clips · CC0 1.0
AudioSet
LQS 84
2.1M 10-second YouTube clips labeled across 527 audio event classes.
2.1M audio clips · CC BY 4.0 (labels)
GigaSpeech
LQS 85
10,000 hours of transcribed English speech from podcasts, audiobooks, and YouTube.
8.3M speech segments · Apache 2.0
VoxCeleb
LQS 82
1M+ speech utterances from 7K+ celebrities extracted from YouTube — the standard speaker verification benchmark.
1.3M speech utterances · CC BY-SA 4.0 (VoxCeleb1) / Research (VoxCeleb2)
🚗

Autonomous Vehicles

4 datasets
nuScenes
LQS 94
Full AV sensor suite — 1.4M camera images + 390K LiDAR sweeps across 1,000 Boston/Singapore scenes.
1.4M camera images · CC BY-NC-SA 4.0
KITTI Vision Benchmark
LQS 86
The original self-driving benchmark — stereo, LiDAR, and 200K object labels from Karlsruhe.
200K labeled objects · CC BY-NC-SA 3.0
Waymo Open Dataset
LQS 92
High-resolution AV data from Waymo — 1,150 scenes, 12.6M LiDAR labels, 11.8M 2D box labels.
1.1K driving scenes · Waymo Dataset License (research only)
Cityscapes
LQS 84
25,000 urban street scene images with pixel-level semantic segmentation masks.
25K images · Cityscapes License (research use)
👁

Computer Vision

11 datasets
COCO — Common Objects in Context
LQS 89
Large-scale object detection, segmentation, and captioning dataset from Microsoft Research.
330K images · CC BY 4.0
ImageNet
LQS 83
14.2M hand-annotated images across 21K categories — the dataset that launched deep learning.
14.2M images · Custom (non-commercial research)
SA-1B — Segment Anything
LQS 89
11M licensed images with 1.1 billion segmentation masks from Meta AI.
1.1B segmentation masks · SA-1B Research License
Open Images V7
LQS 86
9M images with 36M image-level labels, 16M bounding boxes, and 2.7M segmentation masks.
9M images · CC BY 4.0 (annotations) / varies (images)
CIFAR-100
LQS 88
60,000 tiny 32×32 images across 100 balanced classes — a standard classification benchmark.
60K images · MIT-style (unrestricted research use)
Places365
LQS 83
10M scene images across 365 everyday place categories — from MIT CSAIL.
10M images · CC BY (research use)
Visual Genome
LQS 88
108K images with dense scene graph annotations — 5.4M region descriptions.
5.4M region descriptions · CC BY 4.0
ADE20K
LQS 86
Scene parsing benchmark with 25K images and pixel-level masks across 3,500+ object classes.
25.6K images · BSD-3-Clause
Pascal VOC 2012
LQS 83
Classic 20-class detection and segmentation benchmark — still a default for quick experiments.
11.5K images · Flickr Terms (permissive for research)
MNIST
LQS 83
70,000 handwritten digits — the canonical intro-ML benchmark.
70K images · Public Domain
Fashion-MNIST
LQS 83
Drop-in MNIST replacement with 70,000 fashion item images across 10 classes.
70K images · MIT
📄

Document / OCR

1 dataset
PubLayNet
LQS 86
360K document page images with layout annotations — the canonical doc parsing dataset.
360K document page images · CDLA-Permissive 1.0
📈

Financial / Crypto

3 datasets
SEC EDGAR Filings
LQS 86
Every US public company filing since 1993 — 20M+ documents, free and public domain.
20M filings · Public Domain (US Government work)
FinanceBench
LQS 87
10,231 expert-verified financial Q&A pairs across public company filings.
10.2K Q&A pairs · CC BY 4.0
Yahoo Finance Historical Data
LQS 89
Daily OHLCV price history for 6,000+ tickers going back to 1970.
80M daily OHLCV rows · Yahoo Terms (personal use + attribution)
🏥

Medical Imaging

5 datasets
MIMIC-IV
LQS 93
Deidentified ICU and ED records for 315K patients from BIDMC — credentialed access.
454K hospital admissions · PhysioNet Credentialed Health Data License
NIH ChestX-ray14
LQS 80
112,120 frontal chest X-rays from 30,805 patients with 14 disease labels.
112K X-ray images · CC0 1.0
CheXpert
LQS 79
224,316 chest X-rays from Stanford with automated + expert labels for 14 observations.
224K X-ray images · Stanford Research Use License
BraTS 2023 — Brain Tumor Segmentation
LQS 87
Multi-institutional glioma MRI challenge — 5,000+ annotated scans across 5 tumor segmentation tasks.
5K MRI studies · CC BY-NC 4.0
MedMNIST v2
LQS 89
708K medical images pre-processed to MNIST-like format across 12 datasets and 6 modalities.
708K medical images · CC BY 4.0
🔀

Multimodal

4 datasets
LAION-5B
LQS 76
5.85 billion CLIP-filtered image-text pairs — the largest open multimodal dataset.
5.8B image-text pairs · CC BY 4.0 (metadata only)
MSR-VTT — Microsoft Video-to-Text
LQS 87
10K web video clips with 200K human-written captions — the standard video captioning benchmark.
200K video-caption pairs · Microsoft Research License (research use)
HowTo100M
LQS 81
136M narrated video clips from 1.2M instructional YouTube videos.
136M video clips · Apache 2.0 (metadata)
Ego4D
LQS 88
3,670 hours of first-person video from 931 participants in 9 countries.
3.7K hours of egocentric video · Ego4D License (DUA required)
💬

NLP / Text

8 datasets
SQuAD 2.0 — Stanford Question Answering Dataset
LQS 85
150K crowdsourced question-answer pairs on Wikipedia passages, including unanswerable questions.
150K Q&A pairs · CC BY-SA 4.0
The Pile
LQS 80
825 GB diverse English text corpus for LLM pretraining, assembled from 22 high-quality sources.
300M documents · MIT (mixed sub-licenses)
C4 — Colossal Clean Crawled Corpus
LQS 80
365M cleaned web documents (156B tokens) — the pretraining corpus behind T5.
365M documents · ODC-BY
Wikipedia (English Dump)
LQS 83
6.8M+ English articles — the most-used clean text corpus for pretraining and retrieval.
6.8M articles · CC BY-SA 4.0
MS MARCO
LQS 84
8.8M passages and 1M anonymized Bing queries — the go-to benchmark for passage ranking.
8.8M passages · MS MARCO License (research only)
GLUE Benchmark
LQS 88
9 NLU tasks bundled as the industry-standard fine-tuning benchmark.
1.1M labeled examples · Various per-task (mostly CC BY / MIT)
Common Crawl
LQS 69
250+ billion web pages (petabytes) — the raw material behind most LLM pretraining.
250B web pages · Terms of use (redistribute with attribution)
OSCAR
LQS 80
Multilingual web corpus spanning 166 languages, extracted from Common Crawl.
431M documents · CC0 1.0

Looking for production-ready datasets?

LabelSets' paid marketplace offers LQS-verified datasets with explicit commercial licenses, instant download, and no research-only restrictions — from $29 to flagship-tier.

Browse paid datasets → Sell your dataset

About the catalog

These are third-party public datasets maintained by universities, research labs, companies, and consortia. LabelSets indexes them for discoverability and applies our 7-dimension LabelSets Quality Score (LQS) so you can compare quality across sources.

We do not redistribute these datasets. Each entry links to the original maintainer; download and licensing happen on their platform. Our job is to make these easier to find, compare, and evaluate.

If you maintain a public dataset and want it added (or corrected), email us. If you need commercial-licensed data for production ML, browse our paid marketplace.