🔀 Curated Catalog · Multimodal

LAION-5B

5.85 billion CLIP-filtered image-text pairs — the largest open multimodal dataset.

LQS 76 · gold ✓ Commercial OK 5.8B image-text pairs 240 GB Parquet · Metadata Released 2022

Browse commercial Multimodal → Visit original source ↗

Source: laion.ai · maintained by LAION (Large-scale Artificial Intelligence Open Network)

About this dataset

LAION-5B is the largest openly-accessible image-text dataset, comprising 5.85 billion image-text pairs scraped from Common Crawl and CLIP-filtered for quality. It powered training of Stable Diffusion and many other open-source multimodal models. The dataset is split by language (2.32B English pairs, 2.26B multi-lingual, 1.27B unknown). Note: LAION provides URLs and metadata, not the images themselves.

Maintainer

LAION (Large-scale Artificial Intelligence Open Network)

License

CC BY 4.0 (metadata only)

Formats

Parquet · Metadata

Paper

Read on arxiv.org →

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

out of 100

gold tier

Solid dataset with some trade-offs

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 72

No public completeness metric; using prior for 'web_scrape' datasets.

Uniqueness 78

CLIP-embedding similarity filtering — some near-dupes possible at embedding level.

Validation 75

Labels generated by a trained model (e.g., automatic mask generation).

Size adequacy 100

5,850,000,000 pairs — exceeds 100,000 adequacy target for Multimodal.

Format compliance 88

Custom format with published schema documentation.

Label density 52

Average 1.0 labels per item (sparse).

Class balance 58

Long-tail distribution — dominant classes overrepresented.

What it's used for

Common tasks and benchmarks where LAION-5B is the default or competitive choice.

Multimodal pre-training
Text-to-image generation
CLIP-style retrieval
Zero-shot classification

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

5.85B total pairs: 2.32B English, 2.26B multi-lingual, 1.27B unknown language. Mean CLIP similarity ~0.28. Only URLs + captions; images fetched on demand.

License

LAION-5B is distributed under CC BY 4.0 (metadata only). This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Need commercial-licensed Multimodal data?

LabelSets sellers offer paid multimodal datasets with what public datasets often can't give you:

Explicit commercial license in writing
LQS-verified quality in your specific use-case
Instant download — no DUA, credentialed access, or research gating
PII scanned, deduplicated, and production-ready

Browse paid Multimodal → Sell your dataset

Frequently Asked Questions

LAION-5B is distributed under CC BY 4.0 (metadata only), which generally permits commercial use. Always verify the current license terms with the maintainer (LAION (Large-scale Artificial Intelligence Open Network)) before using in a commercial product.

LAION-5B contains 5,850,000,000 image-text pairs. 5.85B total pairs: 2.32B English, 2.26B multi-lingual, 1.27B unknown language. Mean CLIP similarity ~0.28. Only URLs + captions; images fetched on demand.

LAION-5B is maintained by LAION (Large-scale Artificial Intelligence Open Network) and is available at https://laion.ai/blog/laion-5b/. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.

LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.

LAION-5B

About this dataset

LabelSets Quality Score

Solid dataset with some trade-offs

What it's used for

Sample statistics

License

Need commercial-licensed Multimodal data?

Similar public datasets

Frequently Asked Questions