Raw Corpus as Specification
The raw domain corpus defines the knowledge scope and constraints the model should satisfy.
Programming with Data
ProDa closes the loop in LLM data engineering by treating training data as source code, model training as compilation, benchmarking as unit testing, and failure-driven data repair as debugging.
Zhejiang University; University of Chinese Academy of Sciences; Shanghai Artificial Intelligence Laboratory
Core idea
Instead of adding more data blindly when a model fails, ProDa traces benchmark failures to concept-level and reasoning-chain deficits, then generates targeted data patches grounded in the original corpus.
The raw domain corpus defines the knowledge scope and constraints the model should satisfy.
Synthesized SFT data explicitly encodes concepts, relations, and reasoning patterns.
Fine-tuning compiles human-readable training data into executable model weights.
ProDa-16 tests whether the compiled model implements the corpus-derived specification.
Failures are traced through the shared knowledge structure and repaired with targeted data patches.
Framework
ProDa connects evaluation and data generation through one shared structured knowledge representation.
Extracts structured knowledge from raw corpora and compiles it into rigorous benchmarks before training begins.
Synthesizes training data from the same knowledge structure, separating knowledge content from reasoning structure.
Treats benchmark failures as runtime errors, locates responsible knowledge nodes, and produces targeted patches.
ProDalib
ProDalib packages curated corpus chunks, 227k key concepts, 16k benchmark items, and 160k synthesized training samples across 16 disciplines.
ProDa-16 Benchmark
ProDa-16 spans 16 disciplines across natural sciences, engineering, biomedicine, and social and professional sciences. Because it is generated from the same knowledge structure as training data, failures can be traced back to specific concepts, relations, or reasoning chains.
A compact web chart for the paper's Figure 3a correlation values; the full benchmark validation figure above contains panels a-c.
Instruct references, ProDa V1, ProDa V2, and V2-V1 gains. Discipline codes: 001 Physics, 002 Engineering, 003 Medicine, 004 Mathematics, 005 Computer Science, 006 Biology, 007 Chemistry, 008 Earth Science, 009 Materials Science, 010 Education, 011 Economics, 012 History, 013 Environmental Science, 014 Sociology, 015 Psychology, 016 Astronomy.
Results
A first synthesis round produces competitive domain competence, while diagnostic repair closes model-specific gaps with targeted data patches.
ProDa-16 average accuracy from the paper source full experimental results table.
| Model | V1 | V2 | Gain | Notes |
|---|---|---|---|---|
| Llama-3.1-8B | 30.35% | 63.02% | +32.67 | largest reported gain |
| Qwen-2.5-3B | 50.42% | 67.87% | +17.45 | substantial repair effect |
| Qwen-2.5-7B | 65.86% | 70.79% | +4.93 | used in scaling comparison |
| Qwen-2.5-14B | 70.12% | 76.30% | +6.18 | mid-scale gain |
| Qwen-2.5-32B | 76.54% | 78.84% | +2.30 | high absolute score |
| Qwen-3-14B | 76.44% | 77.21% | +0.77 | smaller gain from stronger base |
| Qwen-3-32B | 77.35% | 79.52% | +2.17 | highest reported V2 score |
Diagnostic-driven targeted data is more effective than blindly scaling synthetic data. ProDa V2 reaches 68.72% with 1K targeted repair samples and peaks at 72.11% at the 5K scale in the reported Qwen2.5-7B comparison.
Case studies
Physics
Diagnoses a concept gap around the uncancelled Fresnel zone and repairs the intensity reasoning.
Economics and Law
Separates proposals from panel findings and reconstructs the relevant SPS Agreement reasoning chain.
Medicine
Repairs the biophysical mechanism behind sustained depolarization and sodium-channel inactivation.
ProDa Studio
ProDa Studio organizes the workflow from knowledge extraction to benchmark generation, fine-tuning data generation, model training, evaluation, and the next diagnostic cycle.
Extract Knowledge Core from L3 chains, L2 statements, and L1 concepts.
Generate benchmark and fine-tuning instances with linked knowledge nodes.
Track fine-tuning loss, learning-rate curves, and per-discipline evaluation.
Start the next diagnostic repair cycle from observed failures.
Draft citation. Verify final author list, venue, arXiv identifier, and URL before publishing.
@misc{pan2026programmingdatatestdrivendata,
title={Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora},
author={Chenkai Pan and Xinglong Xu and Yuhang Xu and Yujun Wu and Siyuan Li and Jintao Chen and Conghui He and Jingxuan Wei and Cheng Tan},
year={2026},
eprint={2604.24819},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2604.24819}
}