LQS v3.0 Methodology — Quality Intelligence Engine

What makes LQS v3 different

Three innovations no other quality system has

LQS v3.0 is not a checklist. It's a quality intelligence engine that predicts how well your model will perform if you train on this data — conditioned on your specific task, stress-tested for robustness, and backed by three independent sources of evidence.

Innovation 1

Task-Conditional Quality Prediction

The score changes based on what you plan to do. A medical imaging dataset scores 92 for radiology classification but 61 for pathology segmentation. Same data, different score — because the system knows what matters for each task. 10 task profiles, auto-detected from metadata.

Innovation 2

Adversarial Robustness Profile

Instead of "this is 85/100," you get a degradation curve. At 5% label noise, how much does model performance drop? At 20%? This tells you whether the dataset is robust or fragile — before you spend compute discovering it the hard way.

Innovation 3

Evidence Triangulation

Every score comes from three independent sources: statistical analysis, real model training, and adversarial stress testing. When all three agree → high confidence. When they disagree → that disagreement is published as the most valuable signal.

◆ LQS Verified by LabelSets

The trust mark is awarded only when all three evidence sources are available, confidence level is HIGH, and the composite score is ≥ 60. Like UL Listed for electronics or USDA Organic for food — it means the dataset passed a verification pipeline that predicts downstream performance, not just structural checks.

Empirical Evidence

Calibration study: per-dimension correlation with model performance

We trained 3-model ensemble baselines (XGBoost + Random Forest + Logistic Regression for tabular; TF-IDF LogReg for NLP) on 49 anchor datasets across 7 task types, measured held-out macro F1 via 5-fold cross-validation, and computed Spearman rank correlation between each LQS dimension and downstream performance. Three dimensions are statistically significant (p<0.05). The weights update automatically as more datasets flow through the system.

Why this matters: Every other dataset quality system picks weights by intuition. We derived ours from data — and the system recalibrates itself daily. Dimensions that don't correlate with downstream F1 get downweighted to 1%. Dimensions that do correlate get the weight the data says they deserve. As more datasets are validated and more buyers submit training outcomes, the correlations tighten and the scores get smarter.

Live Calibration

-- datasets -- categories -- buyer outcomes

Connecting to calibration engine

Per-dimension Spearman ρ vs. downstream macro F1

Fetching live calibration data...

* Statistically significant at p<0.05 level. Data refreshes every page load from the live calibration engine.

Key finding: Three dimensions now carry 82.3% of the total weight — all backed by statistically significant p-values. Annotation consistency (ρ=+0.512, p=0.005) is the strongest single predictor of downstream model performance. Previous "intuition weights" were wrong: completeness was weighted at 12% but has negative correlation with F1 when measured with ensemble baselines; label error estimate was weighted at 7% but is the third strongest predictor. These weights recalibrate automatically as more datasets flow through the system.

Calibration suite: 49 anchor datasets across 7 task types (multiclass classification, binary classification, text classification, legal AI, medical AI, financial AI, regression). Baseline models: 3-model ensemble (XGBoost + Random Forest + Logistic Regression) for tabular; TF-IDF Logistic Regression for NLP (DistilBERT attempted first, falls back on timeout). 5-fold stratified cross-validation, fixed seed for reproducibility. The system recalibrates daily via a dual-signal feedback loop: buyer-reported training outcomes (gold standard) and validation benchmark F1 from every dataset processed through the marketplace. Raw data published at calibration/results/.

Adversarial Robustness Profile (ARP)

How fragile is the data under noise?

We systematically flip 1%, 5%, 10%, and 20% of labels to random incorrect values, retrain the baseline model each time, and measure how much F1 degrades. The resulting degradation curve tells you whether the dataset is robust enough for production — before you spend compute discovering it isn't.

Dataset	Domain	Clean F1	Robustness	@1% noise	@5%	@10%	@20%
Medical Triage	Medical AI	99.8%	83/100	0.4%	5.4%	8.4%	19.6%
Mushroom	Multiclass	98.5%	81/100	1.5%	5.1%	10.9%	20.9%
Wine Recognition	Multiclass	97.4%	84/100	0.0%	3.1%	11.6%	17.9%
Breast Cancer	Binary	95.8%	80/100	1.4%	5.1%	12.2%	20.5%
Iris	Multiclass	94.9%	81/100	2.8%	2.9%	10.2%	22.7%
Spambase	Binary	93.7%	80/100	1.2%	5.1%	11.4%	22.5%
SMS Spam	Text Cls.	91.4%	72/100	2.0%	9.0%	16.0%	29.7%
Ionosphere	Binary	90.4%	82/100	1.7%	6.7%	9.3%	17.9%
AG News	Text Cls.	88.8%	81/100	1.1%	5.2%	10.3%	20.5%
Financial Routing	Financial AI	88.1%	80/100	1.3%	5.0%	10.5%	21.0%
Clinical Reasoning	Medical AI	86.2%	76/100	1.4%	9.3%	15.3%	22.7%
Titanic	Binary	80.5%	84/100	1.0%	4.0%	9.4%	17.4%
Legal Contracts	Legal AI	59.3%	84/100	0.3%	2.9%	6.3%	23.0%
Legal Multi-Jurisdiction	Legal AI	34.3%	69/100	5.3%	15.0%	8.4%	34.1%
Wine Quality	Multiclass	34.0%	74/100	7.6%	6.9%	13.0%	25.0%

What the ARP reveals (49 datasets tested): Medical Triage achieves near-perfect F1 (99.8%) but degrades 19.6% at 20% noise — template-based data is easy to learn but noise-sensitive. The Multi-Jurisdiction Legal corpus degrades 34.1% at 20% noise — expected for fine-grained multi-class tasks. Titanic is remarkably robust (only 17.4% at 20%) — strong binary signal tolerates noise well. Every dataset in the marketplace gets this profile so buyers know fragility before they buy.

Robustness tiers: Robust (≥85) — degrades gracefully, production-ready. Moderate (65–84) — some noise sensitivity, label auditing recommended. Fragile (<65) — high sensitivity, requires quality control before training. Tested across 49 datasets with 3-model ensemble baselines and 4 noise injection rates.

Task-Conditional Quality Prediction (TCQP)

Same dataset, different score — because tasks have different needs

A single quality number is misleading. A dataset with perfect class balance is critical for fraud detection but irrelevant for NER. LQS v3.0 reweights the 14 dimensions based on your intended task, producing a score that predicts performance for YOUR specific use case.

Binary Classification 13 datasets

signal_strength ×3.0 · distribution_health ×3.0 · size_adequacy ×2.1 · annotation_consistency ×1.5 · class_distribution ×1.4

Multi-class Classification 17 datasets

completeness ×3.0 · uniqueness ×3.0 · schema_validity ×3.0 · signal_strength ×1.7

Text Classification 3 datasets

completeness ×3.0 · label_accuracy ×3.0 · schema_validity ×3.0 · distribution_health ×3.0 · diversity_score ×3.0

Legal AI 3 datasets

annotation_consistency ×2.0 · label_error_estimate ×1.8 · completeness ×3.0 · schema_validity ×3.0

Medical AI 3 datasets

annotation_consistency ×3.0 · label_accuracy ×2.0 · completeness ×1.8

Financial AI 3 datasets

class_distribution ×3.0 · signal_strength ×2.0 · uniqueness ×1.5

Regression 3 datasets

distribution_health ×3.0 · signal_strength ×3.0 · size_adequacy ×3.0 · label_error_estimate ×1.9 · class_distribution ×1.8

LLM Fine-Tuning expert-tuned

diversity_score ×2.5 · annotation_consistency ×2.0 · uniqueness ×2.0

Named Entity Recognition expert-tuned

label_density ×2.5 · annotation_consistency ×2.5 · completeness ×1.5

Object Detection expert-tuned

label_density ×2.5 · annotation_consistency ×2.0 · class_distribution ×1.5

How it works: Profiles marked with dataset counts are empirically derived — the multipliers come from per-task Spearman correlations computed against the calibration suite. "Expert-tuned" profiles use domain-knowledge multipliers and will be upgraded to empirical once enough datasets of that type flow through the marketplace. All multipliers are renormalized to sum to 1.0. Task profiles auto-detect from dataset category and format. The system recalibrates task multipliers daily as new datasets are validated.

Evidence Triangulation

Three sources must agree before we call it "verified"

A quality score backed by one methodology is a claim. A score backed by three independent methodologies that agree is evidence. LQS v3.0 requires convergence across statistical analysis, real model training, and adversarial stress testing before awarding HIGH confidence.

Source	What it measures	How it's computed	What disagreement means
Statistical	Data properties: completeness, balance, consistency, schema	14-dimension LQS score from direct file analysis	If statistical says high but empirical says low → data looks clean but doesn't train well (possible spurious correlations)
Empirical	Actual trainability: does a model learn from this data?	Real baseline model (NB) trained on 80% split, F1 measured on held-out 20%	If empirical says high but statistical says low → data is messy but trainable (hidden signal in noise)
Adversarial	Robustness: how fragile is the training signal?	Label-flip perturbation at 1%/5%/10%/20%, degradation curve	If statistical and empirical agree but adversarial says fragile → data trains well but may fail in production under noise

Confidence levels: HIGH = all three sources available and spread ≤ 20 points. MEDIUM = two sources or spread 20–40 points. LOW = one source only. The "LQS Verified" trust mark requires HIGH confidence — no shortcuts.

Formula

Composite score

The v3.0 composite blends all three evidence sources. Statistical scoring provides the base. Empirical training adjusts ±30%. Adversarial robustness adjusts ±15%. The final number is a prediction of downstream performance, not a checklist result.

// LQS v3.0 composite formula
final_score = statistical_composite × 0.70
+ empirical_f1 × 0.30 // if available
× 0.85 + robustness × 0.15 // if ARP available

// statistical_composite = Σ(dim_score × calibrated_weight)
// calibrated_weight derived from Spearman ρ (see table above)
// empirical_f1 from real baseline model training
// robustness from adversarial label-flip degradation

90–100

Platinum

75–89

Gold

60–74

Silver

0–59

Bronze

Format Support

12 format types

LQS v3.0 has format-specific analysis pipelines for every major ML data format. Each pipeline extracts the signals most meaningful for that format type.

Format	Type	Key signals
YOLO TXT	Vision / detection	annotations/image, coord validity, class frequency, bbox area distribution, density CV
COCO JSON	Vision / detection	image-annotation linking, category coverage, bbox validation
Pascal VOC XML	Vision / detection	XML parse rate, bbox validity, class names, density CV
KITTI TXT	Vision / 3D detection	15-field format, 3D bbox parameters
LabelMe JSON	Vision / segmentation	polygon validity, class coverage
Image Folder	Vision / classification	class count (folder names), per-class sample count, balance
CSV / TSV	Tabular / NLP	null rate, schema consistency, label distribution, text length, vocab diversity
JSONL	Fine-tuning / NLP	parse errors, field coverage, response length, instruction diversity
Parquet	Tabular	columnar encoding, null rate, duplicate rate, schema
Arrow / Feather	Tabular	schema validation, null rate, type compliance
SQLite	Tabular	table structure, row count, null rate
HDF5	Scientific / arrays	dataset shape, dtype validation, fill values

Versioning

Version history

Version	Date	Changes
v3.0	2026-04-16	Quality Intelligence Engine. Empirically calibrated weights from 49-dataset study with 3-model ensemble baselines (XGBoost + RF + LogReg). Three statistically significant dimensions (p<0.05). 7 empirical task profiles (binary, multiclass, text classification, legal AI, medical AI, financial AI, regression). Self-learning feedback loop: daily recalibration from buyer outcomes + validation benchmarks. Embedding-space analysis, multi-model agreement estimation, adversarial robustness profiling. 23 analysis modules. "LQS Verified" trust mark.
v2.0	2026-04-13	14 dimensions across 5 pillars. ML model runs for trainability scoring. Tier system (Platinum / Gold / Silver / Bronze). Legal domain augmentation (3 legal-specific dimensions).
v1.0	2026-04-07	Initial 7-dimension system covering structural and annotation fundamentals.

Each dataset record stores the LQS version used to compute its scores. Datasets scored under earlier versions retain their original scores and are re-scored to v3.0 on next re-validation.

The first dataset quality systembacked by real training data.

Three innovations no other quality system has

Calibration study: per-dimension correlation with model performance

Per-dimension Spearman ρ vs. downstream macro F1

How fragile is the data under noise?

Same dataset, different score — because tasks have different needs

Three sources must agree before we call it "verified"

Composite score

12 format types

Version history

The first dataset quality system
backed by real training data.