Empirically calibrated weights. Task-conditional scoring. Adversarial robustness profiling. Evidence triangulation with calibrated confidence. Published methodology with full correlation data.
LQS v3.0 is not a checklist. It's a quality intelligence engine that predicts how well your model will perform if you train on this data — conditioned on your specific task, stress-tested for robustness, and backed by three independent sources of evidence.
The trust mark is awarded only when all three evidence sources are available, confidence level is HIGH, and the composite score is ≥ 60. Like UL Listed for electronics or USDA Organic for food — it means the dataset passed a verification pipeline that predicts downstream performance, not just structural checks.
We trained 3-model ensemble baselines (XGBoost + Random Forest + Logistic Regression for tabular; TF-IDF LogReg for NLP) on 49 anchor datasets across 7 task types, measured held-out macro F1 via 5-fold cross-validation, and computed Spearman rank correlation between each LQS dimension and downstream performance. Three dimensions are statistically significant (p<0.05). The weights update automatically as more datasets flow through the system.
Calibration suite: 49 anchor datasets across 7 task types (multiclass classification, binary classification, text classification, legal AI, medical AI, financial AI, regression). Baseline models: 3-model ensemble (XGBoost + Random Forest + Logistic Regression) for tabular; TF-IDF Logistic Regression for NLP (DistilBERT attempted first, falls back on timeout). 5-fold stratified cross-validation, fixed seed for reproducibility. The system recalibrates daily via a dual-signal feedback loop: buyer-reported training outcomes (gold standard) and validation benchmark F1 from every dataset processed through the marketplace. Raw data published at calibration/results/.
We systematically flip 1%, 5%, 10%, and 20% of labels to random incorrect values, retrain the baseline model each time, and measure how much F1 degrades. The resulting degradation curve tells you whether the dataset is robust enough for production — before you spend compute discovering it isn't.
| Dataset | Domain | Clean F1 | Robustness | @1% noise | @5% | @10% | @20% |
|---|---|---|---|---|---|---|---|
| Medical Triage | Medical AI | 99.8% | 83/100 | 0.4% | 5.4% | 8.4% | 19.6% |
| Mushroom | Multiclass | 98.5% | 81/100 | 1.5% | 5.1% | 10.9% | 20.9% |
| Wine Recognition | Multiclass | 97.4% | 84/100 | 0.0% | 3.1% | 11.6% | 17.9% |
| Breast Cancer | Binary | 95.8% | 80/100 | 1.4% | 5.1% | 12.2% | 20.5% |
| Iris | Multiclass | 94.9% | 81/100 | 2.8% | 2.9% | 10.2% | 22.7% |
| Spambase | Binary | 93.7% | 80/100 | 1.2% | 5.1% | 11.4% | 22.5% |
| SMS Spam | Text Cls. | 91.4% | 72/100 | 2.0% | 9.0% | 16.0% | 29.7% |
| Ionosphere | Binary | 90.4% | 82/100 | 1.7% | 6.7% | 9.3% | 17.9% |
| AG News | Text Cls. | 88.8% | 81/100 | 1.1% | 5.2% | 10.3% | 20.5% |
| Financial Routing | Financial AI | 88.1% | 80/100 | 1.3% | 5.0% | 10.5% | 21.0% |
| Clinical Reasoning | Medical AI | 86.2% | 76/100 | 1.4% | 9.3% | 15.3% | 22.7% |
| Titanic | Binary | 80.5% | 84/100 | 1.0% | 4.0% | 9.4% | 17.4% |
| Legal Contracts | Legal AI | 59.3% | 84/100 | 0.3% | 2.9% | 6.3% | 23.0% |
| Legal Multi-Jurisdiction | Legal AI | 34.3% | 69/100 | 5.3% | 15.0% | 8.4% | 34.1% |
| Wine Quality | Multiclass | 34.0% | 74/100 | 7.6% | 6.9% | 13.0% | 25.0% |
Robustness tiers: Robust (≥85) — degrades gracefully, production-ready. Moderate (65–84) — some noise sensitivity, label auditing recommended. Fragile (<65) — high sensitivity, requires quality control before training. Tested across 49 datasets with 3-model ensemble baselines and 4 noise injection rates.
A single quality number is misleading. A dataset with perfect class balance is critical for fraud detection but irrelevant for NER. LQS v3.0 reweights the 14 dimensions based on your intended task, producing a score that predicts performance for YOUR specific use case.
How it works: Profiles marked with dataset counts are empirically derived — the multipliers come from per-task Spearman correlations computed against the calibration suite. "Expert-tuned" profiles use domain-knowledge multipliers and will be upgraded to empirical once enough datasets of that type flow through the marketplace. All multipliers are renormalized to sum to 1.0. Task profiles auto-detect from dataset category and format. The system recalibrates task multipliers daily as new datasets are validated.
A quality score backed by one methodology is a claim. A score backed by three independent methodologies that agree is evidence. LQS v3.0 requires convergence across statistical analysis, real model training, and adversarial stress testing before awarding HIGH confidence.
| Source | What it measures | How it's computed | What disagreement means |
|---|---|---|---|
| Statistical | Data properties: completeness, balance, consistency, schema | 14-dimension LQS score from direct file analysis | If statistical says high but empirical says low → data looks clean but doesn't train well (possible spurious correlations) |
| Empirical | Actual trainability: does a model learn from this data? | Real baseline model (NB) trained on 80% split, F1 measured on held-out 20% | If empirical says high but statistical says low → data is messy but trainable (hidden signal in noise) |
| Adversarial | Robustness: how fragile is the training signal? | Label-flip perturbation at 1%/5%/10%/20%, degradation curve | If statistical and empirical agree but adversarial says fragile → data trains well but may fail in production under noise |
The v3.0 composite blends all three evidence sources. Statistical scoring provides the base. Empirical training adjusts ±30%. Adversarial robustness adjusts ±15%. The final number is a prediction of downstream performance, not a checklist result.
LQS v3.0 has format-specific analysis pipelines for every major ML data format. Each pipeline extracts the signals most meaningful for that format type.
| Format | Type | Key signals |
|---|---|---|
| YOLO TXT | Vision / detection | annotations/image, coord validity, class frequency, bbox area distribution, density CV |
| COCO JSON | Vision / detection | image-annotation linking, category coverage, bbox validation |
| Pascal VOC XML | Vision / detection | XML parse rate, bbox validity, class names, density CV |
| KITTI TXT | Vision / 3D detection | 15-field format, 3D bbox parameters |
| LabelMe JSON | Vision / segmentation | polygon validity, class coverage |
| Image Folder | Vision / classification | class count (folder names), per-class sample count, balance |
| CSV / TSV | Tabular / NLP | null rate, schema consistency, label distribution, text length, vocab diversity |
| JSONL | Fine-tuning / NLP | parse errors, field coverage, response length, instruction diversity |
| Parquet | Tabular | columnar encoding, null rate, duplicate rate, schema |
| Arrow / Feather | Tabular | schema validation, null rate, type compliance |
| SQLite | Tabular | table structure, row count, null rate |
| HDF5 | Scientific / arrays | dataset shape, dtype validation, fill values |
| Version | Date | Changes |
|---|---|---|
| v3.0 | 2026-04-16 | Quality Intelligence Engine. Empirically calibrated weights from 49-dataset study with 3-model ensemble baselines (XGBoost + RF + LogReg). Three statistically significant dimensions (p<0.05). 7 empirical task profiles (binary, multiclass, text classification, legal AI, medical AI, financial AI, regression). Self-learning feedback loop: daily recalibration from buyer outcomes + validation benchmarks. Embedding-space analysis, multi-model agreement estimation, adversarial robustness profiling. 23 analysis modules. "LQS Verified" trust mark. |
| v2.0 | 2026-04-13 | 14 dimensions across 5 pillars. ML model runs for trainability scoring. Tier system (Platinum / Gold / Silver / Bronze). Legal domain augmentation (3 legal-specific dimensions). |
| v1.0 | 2026-04-07 | Initial 7-dimension system covering structural and annotation fundamentals. |
Each dataset record stores the LQS version used to compute its scores. Datasets scored under earlier versions retain their original scores and are re-scored to v3.0 on next re-validation.
Get your dataset scored against LQS v3.0
Free quality audit →No account required · Results in 60 seconds