Home·Curated Catalog·NLP / Text
💬 Curated Catalog · NLP / Text

GLUE Benchmark

9 NLU tasks bundled as the industry-standard fine-tuning benchmark.

LQS 88 · gold ✓ Commercial OK 1.1M labeled examples 700 MB TSV · JSON Released 2018
Browse commercial NLP / Text → Visit original source ↗
Source: gluebenchmark.com · maintained by NYU / University of Washington / DeepMind
1.1M
labeled examples
700 MB
Size on disk
88
LQS · gold
2018
First released

About this dataset

GLUE (General Language Understanding Evaluation) from NYU bundles 9 natural language understanding tasks — SST-2, MNLI, QNLI, QQP, RTE, CoLA, STS-B, MRPC, WNLI — into a single benchmark for fine-tuning and evaluating pretrained language models. Superseded by SuperGLUE for top models, but still widely reported.

Formats
TSV · JSON

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

88
out of 100
gold tier

High-quality dataset across most dimensions

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 95
No public completeness metric; using prior for 'expert_curated' datasets.
Uniqueness 93
Exact-hash deduplication documented by maintainer.
Validation 95
Multiple expert annotators with reconciliation pass.
Size adequacy 93
1,100,000 items — exceeds 100,000 adequacy target for NLP / Text.
Format compliance 95
Industry-standard format — drop-in compatible with mainstream tooling.
Label density 52
Average 1.0 labels per item (sparse).
Class balance 75
Moderate class skew — realistic production distribution.

What it's used for

Common tasks and benchmarks where GLUE Benchmark is the default or competitive choice.

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

9 tasks: SST-2 (67K sentiment), MNLI (393K NLI), QNLI (105K), QQP (364K paraphrases), RTE, CoLA, STS-B, MRPC, WNLI. Total ~1.1M labeled examples.

License

GLUE Benchmark is distributed under Various per-task (mostly CC BY / MIT). This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Need commercial-licensed NLP / Text data?

LabelSets sellers offer paid nlp / text datasets with what public datasets often can't give you:

Browse paid NLP / Text → Sell your dataset

Similar public datasets

Other entries in the NLP / Text catalog.

Frequently Asked Questions

GLUE Benchmark is distributed under Various per-task (mostly CC BY / MIT), which generally permits commercial use. Always verify the current license terms with the maintainer (NYU / University of Washington / DeepMind) before using in a commercial product.
GLUE Benchmark contains 1,100,000 labeled examples. 9 tasks: SST-2 (67K sentiment), MNLI (393K NLI), QNLI (105K), QQP (364K paraphrases), RTE, CoLA, STS-B, MRPC, WNLI. Total ~1.1M labeled examples.
GLUE Benchmark is maintained by NYU / University of Washington / DeepMind and is available at https://gluebenchmark.com/tasks. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.
LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.