Curated Catalog

44 Real Public ML Datasets,
LQS-Scored and Attributed

An independently-curated index of public datasets used across ML research and production. Every entry has verified attribution, license info, and a LabelSets Quality Score. We don't host these — we link to the original maintainer.

Browse paid datasets → Jump to catalog ↓

44

Datasets indexed

7.2B+

Labeled items

9

Categories

26

Commercial-use OK

🎙Audio5 🚗Autonomous Vehicles4 👁Computer Vision11 📄Document / OCR1 📈Financial / Crypto3 ⚖️Legal3 🏥Medical Imaging5 🔀Multimodal4 💬NLP / Text8

🎙

Audio

5 datasets

1,000 hours of read English audiobook speech — the standard ASR benchmark.

292K speech utterances · CC BY 4.0

Mozilla Common Voice

Crowdsourced multilingual voice dataset — 20K+ hours across 100+ languages.

14M validated clips · CC0 1.0

2.1M 10-second YouTube clips labeled across 527 audio event classes.

2.1M audio clips · CC BY 4.0 (labels)

10,000 hours of transcribed English speech from podcasts, audiobooks, and YouTube.

8.3M speech segments · Apache 2.0

1M+ speech utterances from 7K+ celebrities extracted from YouTube — the standard speaker verification benchmark.

1.3M speech utterances · CC BY-SA 4.0 (VoxCeleb1) / Research (VoxCeleb2)

🚗

Autonomous Vehicles

4 datasets

Full AV sensor suite — 1.4M camera images + 390K LiDAR sweeps across 1,000 Boston/Singapore scenes.

1.4M camera images · CC BY-NC-SA 4.0

KITTI Vision Benchmark

The original self-driving benchmark — stereo, LiDAR, and 200K object labels from Karlsruhe.

200K labeled objects · CC BY-NC-SA 3.0

Waymo Open Dataset

High-resolution AV data from Waymo — 1,150 scenes, 12.6M LiDAR labels, 11.8M 2D box labels.

1.1K driving scenes · Waymo Dataset License (research only)

25,000 urban street scene images with pixel-level semantic segmentation masks.

25K images · Cityscapes License (research use)

👁

Computer Vision

11 datasets

COCO — Common Objects in Context

Large-scale object detection, segmentation, and captioning dataset from Microsoft Research.

330K images · CC BY 4.0

14.2M hand-annotated images across 21K categories — the dataset that launched deep learning.

14.2M images · Custom (non-commercial research)

SA-1B — Segment Anything

11M licensed images with 1.1 billion segmentation masks from Meta AI.

1.1B segmentation masks · SA-1B Research License

9M images with 36M image-level labels, 16M bounding boxes, and 2.7M segmentation masks.

9M images · CC BY 4.0 (annotations) / varies (images)

60,000 tiny 32×32 images across 100 balanced classes — a standard classification benchmark.

60K images · MIT-style (unrestricted research use)

10M scene images across 365 everyday place categories — from MIT CSAIL.

10M images · CC BY (research use)

108K images with dense scene graph annotations — 5.4M region descriptions.

5.4M region descriptions · CC BY 4.0

Scene parsing benchmark with 25K images and pixel-level masks across 3,500+ object classes.

25.6K images · BSD-3-Clause

Pascal VOC 2012

Classic 20-class detection and segmentation benchmark — still a default for quick experiments.

11.5K images · Flickr Terms (permissive for research)

70,000 handwritten digits — the canonical intro-ML benchmark.

70K images · Public Domain

Drop-in MNIST replacement with 70,000 fashion item images across 10 classes.

70K images · MIT

📄

Document / OCR

1 dataset

360K document page images with layout annotations — the canonical doc parsing dataset.

360K document page images · CDLA-Permissive 1.0

📈

Financial / Crypto

3 datasets

SEC EDGAR Filings

Every US public company filing since 1993 — 20M+ documents, free and public domain.

20M filings · Public Domain (US Government work)

10,231 expert-verified financial Q&A pairs across public company filings.

10.2K Q&A pairs · CC BY 4.0

Yahoo Finance Historical Data

Daily OHLCV price history for 6,000+ tickers going back to 1970.

80M daily OHLCV rows · Yahoo Terms (personal use + attribution)

⚖️

Legal

3 datasets

CUAD — Contract Understanding Atticus Dataset

510 commercial contracts with 13,101 expert-labeled clauses across 41 legal categories.

13.1K clause annotations · CC BY 4.0

256 GB of legal text — court opinions, contracts, statutes, and regulatory filings.

10M legal documents · CC BY-NC-SA 4.0

Caselaw Access Project

6.7M US court decisions spanning 360 years — fully digitized by Harvard Law.

6.7M court cases · CC0 (post-2024 release)

🏥

Medical Imaging

5 datasets

Deidentified ICU and ED records for 315K patients from BIDMC — credentialed access.

454K hospital admissions · PhysioNet Credentialed Health Data License

NIH ChestX-ray14

112,120 frontal chest X-rays from 30,805 patients with 14 disease labels.

112K X-ray images · CC0 1.0

224,316 chest X-rays from Stanford with automated + expert labels for 14 observations.

224K X-ray images · Stanford Research Use License

BraTS 2023 — Brain Tumor Segmentation

Multi-institutional glioma MRI challenge — 5,000+ annotated scans across 5 tumor segmentation tasks.

5K MRI studies · CC BY-NC 4.0

708K medical images pre-processed to MNIST-like format across 12 datasets and 6 modalities.

708K medical images · CC BY 4.0

🔀

Multimodal

4 datasets

5.85 billion CLIP-filtered image-text pairs — the largest open multimodal dataset.

5.8B image-text pairs · CC BY 4.0 (metadata only)

MSR-VTT — Microsoft Video-to-Text

10K web video clips with 200K human-written captions — the standard video captioning benchmark.

200K video-caption pairs · Microsoft Research License (research use)

136M narrated video clips from 1.2M instructional YouTube videos.

136M video clips · Apache 2.0 (metadata)

3,670 hours of first-person video from 931 participants in 9 countries.

3.7K hours of egocentric video · Ego4D License (DUA required)

💬

NLP / Text

8 datasets

SQuAD 2.0 — Stanford Question Answering Dataset

150K crowdsourced question-answer pairs on Wikipedia passages, including unanswerable questions.

150K Q&A pairs · CC BY-SA 4.0

825 GB diverse English text corpus for LLM pretraining, assembled from 22 high-quality sources.

300M documents · MIT (mixed sub-licenses)

C4 — Colossal Clean Crawled Corpus

365M cleaned web documents (156B tokens) — the pretraining corpus behind T5.

365M documents · ODC-BY

Wikipedia (English Dump)

6.8M+ English articles — the most-used clean text corpus for pretraining and retrieval.

6.8M articles · CC BY-SA 4.0

8.8M passages and 1M anonymized Bing queries — the go-to benchmark for passage ranking.

8.8M passages · MS MARCO License (research only)

9 NLU tasks bundled as the industry-standard fine-tuning benchmark.

1.1M labeled examples · Various per-task (mostly CC BY / MIT)

250+ billion web pages (petabytes) — the raw material behind most LLM pretraining.

250B web pages · Terms of use (redistribute with attribution)

Multilingual web corpus spanning 166 languages, extracted from Common Crawl.

431M documents · CC0 1.0

Looking for production-ready datasets?

LabelSets' paid marketplace offers LQS-verified datasets with explicit commercial licenses, instant download, and no research-only restrictions — from $29 to flagship-tier.

Browse paid datasets → Sell your dataset

About the catalog

These are third-party public datasets maintained by universities, research labs, companies, and consortia. LabelSets indexes them for discoverability and applies our 7-dimension LabelSets Quality Score (LQS) so you can compare quality across sources.

We do not redistribute these datasets. Each entry links to the original maintainer; download and licensing happen on their platform. Our job is to make these easier to find, compare, and evaluate.

If you maintain a public dataset and want it added (or corrected), email us. If you need commercial-licensed data for production ML, browse our paid marketplace.