📄 Curated Catalog · Document / OCR

PubLayNet

360K document page images with layout annotations — the canonical doc parsing dataset.

LQS 86 · gold ✓ Commercial OK 360K document page images 96 GB PDF · JSON Released 2019

Browse commercial Document / OCR → Visit original source ↗

Source: developer.ibm.com · maintained by IBM Research

About this dataset

PubLayNet from IBM Research is the largest publicly-available document layout analysis dataset. 360K page images from PubMed Central open access articles with automatically-generated layout annotations for text blocks, titles, lists, tables, and figures. Standard benchmark for document understanding models like LayoutLM.

Maintainer

IBM Research

License

CDLA-Permissive 1.0

Formats

PDF · JSON

Paper

Read on arxiv.org →

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

out of 100

gold tier

High-quality dataset across most dimensions

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 92

Published by maintainer: 92% completeness across annotated fields.

Uniqueness 93

Exact-hash deduplication documented by maintainer.

Validation 75

Labels generated by a trained model (e.g., automatic mask generation).

Size adequacy 95

360,000 images — exceeds 10,000 adequacy target for Document / OCR.

Format compliance 95

Industry-standard format — drop-in compatible with mainstream tooling.

Label density 71

Average 5.0 labels per item (high density).

Class balance 75

Moderate class skew — realistic production distribution.

What it's used for

Common tasks and benchmarks where PubLayNet is the default or competitive choice.

Document layout analysis
Table detection
Figure extraction
Reading order

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

335K train + 11K dev + 11K test page images. 5 block types: text, title, list, table, figure. Sourced from PubMed Central OA.

License

PubLayNet is distributed under CDLA-Permissive 1.0. This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Need commercial-licensed Document / OCR data?

LabelSets sellers offer paid document / ocr datasets with what public datasets often can't give you:

Explicit commercial license in writing
LQS-verified quality in your specific use-case
Instant download — no DUA, credentialed access, or research gating
PII scanned, deduplicated, and production-ready

Browse paid Document / OCR → Sell your dataset

Frequently Asked Questions

PubLayNet is distributed under CDLA-Permissive 1.0, which generally permits commercial use. Always verify the current license terms with the maintainer (IBM Research) before using in a commercial product.

PubLayNet contains 360,000 document page images. 335K train + 11K dev + 11K test page images. 5 block types: text, title, list, table, figure. Sourced from PubMed Central OA.

PubLayNet is maintained by IBM Research and is available at https://developer.ibm.com/exchanges/data/all/publaynet/. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.

LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.

PubLayNet

About this dataset

LabelSets Quality Score

High-quality dataset across most dimensions

What it's used for

Sample statistics

License

Need commercial-licensed Document / OCR data?

Similar public datasets

Frequently Asked Questions