LQS v3.0 · Quality Intelligence Engine

The first dataset quality system
backed by real training data.

Empirically calibrated weights. Task-conditional scoring. Adversarial robustness profiling. Evidence triangulation with calibrated confidence. Published methodology with full correlation data.

14 dimensions Empirically calibrated Task-conditional Adversarial stress-tested 11-dataset calibration study
What makes LQS v3 different

Three innovations no other quality system has

LQS v3.0 is not a checklist. It's a quality intelligence engine that predicts how well your model will perform if you train on this data — conditioned on your specific task, stress-tested for robustness, and backed by three independent sources of evidence.

Innovation 1
Task-Conditional Quality Prediction
The score changes based on what you plan to do. A medical imaging dataset scores 92 for radiology classification but 61 for pathology segmentation. Same data, different score — because the system knows what matters for each task. 10 task profiles, auto-detected from metadata.
Innovation 2
Adversarial Robustness Profile
Instead of "this is 85/100," you get a degradation curve. At 5% label noise, how much does model performance drop? At 20%? This tells you whether the dataset is robust or fragile — before you spend compute discovering it the hard way.
Innovation 3
Evidence Triangulation
Every score comes from three independent sources: statistical analysis, real model training, and adversarial stress testing. When all three agree → high confidence. When they disagree → that disagreement is published as the most valuable signal.
LQS Verified by LabelSets

The trust mark is awarded only when all three evidence sources are available, confidence level is HIGH, and the composite score is ≥ 60. Like UL Listed for electronics or USDA Organic for food — it means the dataset passed a verification pipeline that predicts downstream performance, not just structural checks.

Empirical Evidence

Calibration study: per-dimension correlation with model performance

We trained 3-model ensemble baselines (XGBoost + Random Forest + Logistic Regression for tabular; TF-IDF LogReg for NLP) on 49 anchor datasets across 7 task types, measured held-out macro F1 via 5-fold cross-validation, and computed Spearman rank correlation between each LQS dimension and downstream performance. Three dimensions are statistically significant (p<0.05). The weights update automatically as more datasets flow through the system.

Why this matters: Every other dataset quality system picks weights by intuition. We derived ours from data — and the system recalibrates itself daily. Dimensions that don't correlate with downstream F1 get downweighted to 1%. Dimensions that do correlate get the weight the data says they deserve. As more datasets are validated and more buyers submit training outcomes, the correlations tighten and the scores get smarter.
Live Calibration
-- datasets -- categories -- buyer outcomes
Connecting to calibration engine

Per-dimension Spearman ρ vs. downstream macro F1

Fetching live calibration data...
* Statistically significant at p<0.05 level. Data refreshes every page load from the live calibration engine.
Key finding: Three dimensions now carry 82.3% of the total weight — all backed by statistically significant p-values. Annotation consistency (ρ=+0.512, p=0.005) is the strongest single predictor of downstream model performance. Previous "intuition weights" were wrong: completeness was weighted at 12% but has negative correlation with F1 when measured with ensemble baselines; label error estimate was weighted at 7% but is the third strongest predictor. These weights recalibrate automatically as more datasets flow through the system.

Calibration suite: 49 anchor datasets across 7 task types (multiclass classification, binary classification, text classification, legal AI, medical AI, financial AI, regression). Baseline models: 3-model ensemble (XGBoost + Random Forest + Logistic Regression) for tabular; TF-IDF Logistic Regression for NLP (DistilBERT attempted first, falls back on timeout). 5-fold stratified cross-validation, fixed seed for reproducibility. The system recalibrates daily via a dual-signal feedback loop: buyer-reported training outcomes (gold standard) and validation benchmark F1 from every dataset processed through the marketplace. Raw data published at calibration/results/.

Adversarial Robustness Profile (ARP)

How fragile is the data under noise?

We systematically flip 1%, 5%, 10%, and 20% of labels to random incorrect values, retrain the baseline model each time, and measure how much F1 degrades. The resulting degradation curve tells you whether the dataset is robust enough for production — before you spend compute discovering it isn't.

DatasetDomainClean F1Robustness@1% noise@5%@10%@20%
Medical TriageMedical AI99.8%83/1000.4%5.4%8.4%19.6%
MushroomMulticlass98.5%81/1001.5%5.1%10.9%20.9%
Wine RecognitionMulticlass97.4%84/1000.0%3.1%11.6%17.9%
Breast CancerBinary95.8%80/1001.4%5.1%12.2%20.5%
IrisMulticlass94.9%81/1002.8%2.9%10.2%22.7%
SpambaseBinary93.7%80/1001.2%5.1%11.4%22.5%
SMS SpamText Cls.91.4%72/1002.0%9.0%16.0%29.7%
IonosphereBinary90.4%82/1001.7%6.7%9.3%17.9%
AG NewsText Cls.88.8%81/1001.1%5.2%10.3%20.5%
Financial RoutingFinancial AI88.1%80/1001.3%5.0%10.5%21.0%
Clinical ReasoningMedical AI86.2%76/1001.4%9.3%15.3%22.7%
TitanicBinary80.5%84/1001.0%4.0%9.4%17.4%
Legal ContractsLegal AI59.3%84/1000.3%2.9%6.3%23.0%
Legal Multi-JurisdictionLegal AI34.3%69/1005.3%15.0%8.4%34.1%
Wine QualityMulticlass34.0%74/1007.6%6.9%13.0%25.0%
What the ARP reveals (49 datasets tested): Medical Triage achieves near-perfect F1 (99.8%) but degrades 19.6% at 20% noise — template-based data is easy to learn but noise-sensitive. The Multi-Jurisdiction Legal corpus degrades 34.1% at 20% noise — expected for fine-grained multi-class tasks. Titanic is remarkably robust (only 17.4% at 20%) — strong binary signal tolerates noise well. Every dataset in the marketplace gets this profile so buyers know fragility before they buy.

Robustness tiers: Robust (≥85) — degrades gracefully, production-ready. Moderate (65–84) — some noise sensitivity, label auditing recommended. Fragile (<65) — high sensitivity, requires quality control before training. Tested across 49 datasets with 3-model ensemble baselines and 4 noise injection rates.

Task-Conditional Quality Prediction (TCQP)

Same dataset, different score — because tasks have different needs

A single quality number is misleading. A dataset with perfect class balance is critical for fraud detection but irrelevant for NER. LQS v3.0 reweights the 14 dimensions based on your intended task, producing a score that predicts performance for YOUR specific use case.

Binary Classification 13 datasets
signal_strength ×3.0 · distribution_health ×3.0 · size_adequacy ×2.1 · annotation_consistency ×1.5 · class_distribution ×1.4
Multi-class Classification 17 datasets
completeness ×3.0 · uniqueness ×3.0 · schema_validity ×3.0 · signal_strength ×1.7
Text Classification 3 datasets
completeness ×3.0 · label_accuracy ×3.0 · schema_validity ×3.0 · distribution_health ×3.0 · diversity_score ×3.0
Legal AI 3 datasets
annotation_consistency ×2.0 · label_error_estimate ×1.8 · completeness ×3.0 · schema_validity ×3.0
Medical AI 3 datasets
annotation_consistency ×3.0 · label_accuracy ×2.0 · completeness ×1.8
Financial AI 3 datasets
class_distribution ×3.0 · signal_strength ×2.0 · uniqueness ×1.5
Regression 3 datasets
distribution_health ×3.0 · signal_strength ×3.0 · size_adequacy ×3.0 · label_error_estimate ×1.9 · class_distribution ×1.8
LLM Fine-Tuning expert-tuned
diversity_score ×2.5 · annotation_consistency ×2.0 · uniqueness ×2.0
Named Entity Recognition expert-tuned
label_density ×2.5 · annotation_consistency ×2.5 · completeness ×1.5
Object Detection expert-tuned
label_density ×2.5 · annotation_consistency ×2.0 · class_distribution ×1.5

How it works: Profiles marked with dataset counts are empirically derived — the multipliers come from per-task Spearman correlations computed against the calibration suite. "Expert-tuned" profiles use domain-knowledge multipliers and will be upgraded to empirical once enough datasets of that type flow through the marketplace. All multipliers are renormalized to sum to 1.0. Task profiles auto-detect from dataset category and format. The system recalibrates task multipliers daily as new datasets are validated.

Evidence Triangulation

Three sources must agree before we call it "verified"

A quality score backed by one methodology is a claim. A score backed by three independent methodologies that agree is evidence. LQS v3.0 requires convergence across statistical analysis, real model training, and adversarial stress testing before awarding HIGH confidence.

SourceWhat it measuresHow it's computedWhat disagreement means
Statistical Data properties: completeness, balance, consistency, schema 14-dimension LQS score from direct file analysis If statistical says high but empirical says low → data looks clean but doesn't train well (possible spurious correlations)
Empirical Actual trainability: does a model learn from this data? Real baseline model (NB) trained on 80% split, F1 measured on held-out 20% If empirical says high but statistical says low → data is messy but trainable (hidden signal in noise)
Adversarial Robustness: how fragile is the training signal? Label-flip perturbation at 1%/5%/10%/20%, degradation curve If statistical and empirical agree but adversarial says fragile → data trains well but may fail in production under noise
Confidence levels: HIGH = all three sources available and spread ≤ 20 points. MEDIUM = two sources or spread 20–40 points. LOW = one source only. The "LQS Verified" trust mark requires HIGH confidence — no shortcuts.
Formula

Composite score

The v3.0 composite blends all three evidence sources. Statistical scoring provides the base. Empirical training adjusts ±30%. Adversarial robustness adjusts ±15%. The final number is a prediction of downstream performance, not a checklist result.

// LQS v3.0 composite formula
final_score = statistical_composite × 0.70
            + empirical_f1 × 0.30   // if available
            × 0.85 + robustness × 0.15   // if ARP available

// statistical_composite = Σ(dim_score × calibrated_weight)
// calibrated_weight derived from Spearman ρ (see table above)
// empirical_f1 from real baseline model training
// robustness from adversarial label-flip degradation
90–100
Platinum
75–89
Gold
60–74
Silver
0–59
Bronze
Format Support

12 format types

LQS v3.0 has format-specific analysis pipelines for every major ML data format. Each pipeline extracts the signals most meaningful for that format type.

FormatTypeKey signals
YOLO TXTVision / detectionannotations/image, coord validity, class frequency, bbox area distribution, density CV
COCO JSONVision / detectionimage-annotation linking, category coverage, bbox validation
Pascal VOC XMLVision / detectionXML parse rate, bbox validity, class names, density CV
KITTI TXTVision / 3D detection15-field format, 3D bbox parameters
LabelMe JSONVision / segmentationpolygon validity, class coverage
Image FolderVision / classificationclass count (folder names), per-class sample count, balance
CSV / TSVTabular / NLPnull rate, schema consistency, label distribution, text length, vocab diversity
JSONLFine-tuning / NLPparse errors, field coverage, response length, instruction diversity
ParquetTabularcolumnar encoding, null rate, duplicate rate, schema
Arrow / FeatherTabularschema validation, null rate, type compliance
SQLiteTabulartable structure, row count, null rate
HDF5Scientific / arraysdataset shape, dtype validation, fill values
Versioning

Version history

VersionDateChanges
v3.02026-04-16Quality Intelligence Engine. Empirically calibrated weights from 49-dataset study with 3-model ensemble baselines (XGBoost + RF + LogReg). Three statistically significant dimensions (p<0.05). 7 empirical task profiles (binary, multiclass, text classification, legal AI, medical AI, financial AI, regression). Self-learning feedback loop: daily recalibration from buyer outcomes + validation benchmarks. Embedding-space analysis, multi-model agreement estimation, adversarial robustness profiling. 23 analysis modules. "LQS Verified" trust mark.
v2.02026-04-1314 dimensions across 5 pillars. ML model runs for trainability scoring. Tier system (Platinum / Gold / Silver / Bronze). Legal domain augmentation (3 legal-specific dimensions).
v1.02026-04-07Initial 7-dimension system covering structural and annotation fundamentals.

Each dataset record stores the LQS version used to compute its scores. Datasets scored under earlier versions retain their original scores and are re-scored to v3.0 on next re-validation.

Get your dataset scored against LQS v3.0

Free quality audit →

No account required · Results in 60 seconds