Home·Curated Catalog·Document / OCR
📄 Curated Catalog · Document / OCR

PubLayNet

360K document page images with layout annotations — the canonical doc parsing dataset.

LQS 86 · gold ✓ Commercial OK 360K document page images 96 GB PDF · JSON Released 2019
Browse commercial Document / OCR → Visit original source ↗
Source: developer.ibm.com · maintained by IBM Research
360K
document page images
96 GB
Size on disk
86
LQS · gold
2019
First released

About this dataset

PubLayNet from IBM Research is the largest publicly-available document layout analysis dataset. 360K page images from PubMed Central open access articles with automatically-generated layout annotations for text blocks, titles, lists, tables, and figures. Standard benchmark for document understanding models like LayoutLM.

Maintainer
Formats
PDF · JSON

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

86
out of 100
gold tier

High-quality dataset across most dimensions

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 92
Published by maintainer: 92% completeness across annotated fields.
Uniqueness 93
Exact-hash deduplication documented by maintainer.
Validation 75
Labels generated by a trained model (e.g., automatic mask generation).
Size adequacy 95
360,000 images — exceeds 10,000 adequacy target for Document / OCR.
Format compliance 95
Industry-standard format — drop-in compatible with mainstream tooling.
Label density 71
Average 5.0 labels per item (high density).
Class balance 75
Moderate class skew — realistic production distribution.

What it's used for

Common tasks and benchmarks where PubLayNet is the default or competitive choice.

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

335K train + 11K dev + 11K test page images. 5 block types: text, title, list, table, figure. Sourced from PubMed Central OA.

License

PubLayNet is distributed under CDLA-Permissive 1.0. This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Need commercial-licensed Document / OCR data?

LabelSets sellers offer paid document / ocr datasets with what public datasets often can't give you:

Browse paid Document / OCR → Sell your dataset

Similar public datasets

Other entries in the Document / OCR catalog.

Frequently Asked Questions

PubLayNet is distributed under CDLA-Permissive 1.0, which generally permits commercial use. Always verify the current license terms with the maintainer (IBM Research) before using in a commercial product.
PubLayNet contains 360,000 document page images. 335K train + 11K dev + 11K test page images. 5 block types: text, title, list, table, figure. Sourced from PubMed Central OA.
PubLayNet is maintained by IBM Research and is available at https://developer.ibm.com/exchanges/data/all/publaynet/. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.
LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.