Free Dataset Quality Audit

How it works

From upload to report in 60 seconds

Upload a sample

Drop any CSV, JSON, JSONL, or ZIP. We only need a representative sample — up to 20 MB.

We analyze it instantly

Our pipeline evaluates 14 dimensions across 5 pillars — structural integrity, annotation quality, statistical health, training fitness, and provenance — including real ML model runs.

Report in your inbox

A full LQS breakdown with actionable tips to improve your score before listing.

List and earn

Platinum-tier datasets sell faster and command premium prices. 3× more revenue on average.

LQS v2.0 — Published Methodology

14 dimensions across 5 tiers

Tier 1 — Structural Integrity · 35%

✅

Completeness

Null/missing value rates per column; orphaned record detection.

12% of score

🔁

Uniqueness

Exact duplicate detection via hash comparison across all rows.

8% of score

🗂

Schema Validity

Type violations, range errors, and row-level schema drift detection.

8% of score

📄

Format Integrity

Spec compliance — YOLO/COCO/VOC/CSV/JSONL encoding and structure.

7% of score

Tier 2 — Annotation Quality · 30%

🎯

Label Accuracy

Malformed annotation rate — invalid coords, missing fields, parse failures.

10% of score

🏷

Label Density

Annotations per image / response length / label coverage per sample.

8% of score

📐

Annotation Consistency

Coefficient of variation of annotation density — catches mixed labeling standards.

7% of score

⚖

Class Distribution

Shannon entropy normalized to [0,1]; imbalance ratio; rare-class coverage.

5% of score

Tier 3 — Statistical Health · 20%

📈

Distribution Health

Bbox area spread, text length variance, null-rate distribution across features.

8% of score

🔇

Label Error Estimate

Composite estimate of label error rate from invalid annotations, duplicates, and missing labels.

7% of score

⚡

Signal Strength

Class separability proxy — class count × entropy × vocabulary diversity.

5% of score

Tier 4 — Training Fitness · 10%

📊

Size Adequacy

Sample count against task-specific minimums: 200 (audio) → 10K (tabular).

5% of score

🌐

Diversity Score

Type-token vocabulary ratio; class spread; semantic variety across samples.

5% of score

Tier 5 — Provenance · 5%

🔍

Provenance Quality

Description length, tag coverage, license type, and data source documentation.

5% of score

90–100

Platinum

75–89

Gold

60–74

Silver

0–59

Bronze

Composite = weighted average of all 14 dimensions. Scores computed from live file analysis — never self-reported. Full methodology →

Get a free quality score
for your dataset

Ready to start selling?

Upload a sample

We analyze it instantly

Report in your inbox

List and earn

Completeness

Uniqueness

Schema Validity

Format Integrity

Label Accuracy

Label Density

Annotation Consistency

Class Distribution

Distribution Health

Label Error Estimate

Signal Strength

Size Adequacy

Diversity Score

Provenance Quality

Get a free quality scorefor your dataset

Ready to start selling?

Upload a sample

We analyze it instantly

Report in your inbox

List and earn

Completeness

Uniqueness

Schema Validity

Format Integrity

Label Accuracy

Label Density

Annotation Consistency

Class Distribution

Distribution Health

Label Error Estimate

Signal Strength

Size Adequacy

Diversity Score

Provenance Quality

Get a free quality score
for your dataset