Programming with Data

Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

ProDa closes the loop in LLM data engineering by treating training data as source code, model training as compilation, benchmarking as unit testing, and failure-driven data repair as debugging.

Chenkai Pan Xinglong Xu Yuhang Xu Yujun Wu Siyuan Li Jintao Chen Conghui He Jingxuan Wei Cheng Tan

Zhejiang University; University of Chinese Academy of Sciences; Shanghai Artificial Intelligence Laboratory

Paper

Code

Dataset

117k

raw documents

15B

raw tokens

227k

key concepts

16k

benchmark items

160k

SFT samples

16

disciplines

Core idea

Data engineering as a test-driven programming loop

Instead of adding more data blindly when a model fails, ProDa traces benchmark failures to concept-level and reasoning-chain deficits, then generates targeted data patches grounded in the original corpus.

Raw Corpus as Specification

The raw domain corpus defines the knowledge scope and constraints the model should satisfy.

Training Data as Source Code

Synthesized SFT data explicitly encodes concepts, relations, and reasoning patterns.

Training as Compilation

Fine-tuning compiles human-readable training data into executable model weights.

Benchmarking as Unit Testing

ProDa-16 tests whether the compiled model implements the corpus-derived specification.

Data Debugging as Bug Fixing

Failures are traced through the shared knowledge structure and repaired with targeted data patches.

Framework

Tester, Builder, Debugger

ProDa connects evaluation and data generation through one shared structured knowledge representation.

Tester

Extracts structured knowledge from raw corpora and compiles it into rigorous benchmarks before training begins.

Builder

Synthesizes training data from the same knowledge structure, separating knowledge content from reasoning structure.

Debugger

Treats benchmark failures as runtime errors, locates responsible knowledge nodes, and produces targeted patches.

ProDalib

A traceable data suite built from raw educational corpora

ProDalib packages curated corpus chunks, 227k key concepts, 16k benchmark items, and 160k synthesized training samples across 16 disciplines.

48k

high-quality chunks

43,953

L3 reasoning chains

186,784

L2 relational statements

227,869

L1 atomic concepts

ProDa-16 Benchmark

Executable tests for the data-debugging loop

ProDa-16 spans 16 disciplines across natural sciences, engineering, biomedicine, and social and professional sciences. Because it is generated from the same knowledge structure as training data, failures can be traced back to specific concepts, relations, or reasoning chains.

0.847

mean Spearman rank correlation

0.943

GPQA correlation

0.905

MMLU-Pro correlation

Correlation with established benchmarks

A compact web chart for the paper's Figure 3a correlation values; the full benchmark validation figure above contains panels a-c.

Table 1. Full benchmark results across all 16 disciplines

Instruct references, ProDa V1, ProDa V2, and V2-V1 gains. Discipline codes: 001 Physics, 002 Engineering, 003 Medicine, 004 Mathematics, 005 Computer Science, 006 Biology, 007 Chemistry, 008 Earth Science, 009 Materials Science, 010 Education, 011 Economics, 012 History, 013 Environmental Science, 014 Sociology, 015 Psychology, 016 Astronomy.

Results

Diagnostic repair beats blind scaling

A first synthesis round produces competitive domain competence, while diagnostic repair closes model-specific gaps with targeted data patches.

Reported V1 to V2 gains

ProDa-16 average accuracy from the paper source full experimental results table.

Model	V1	V2	Gain	Notes
Llama-3.1-8B	30.35%	63.02%	+32.67	largest reported gain
Qwen-2.5-3B	50.42%	67.87%	+17.45	substantial repair effect
Qwen-2.5-7B	65.86%	70.79%	+4.93	used in scaling comparison
Qwen-2.5-14B	70.12%	76.30%	+6.18	mid-scale gain
Qwen-2.5-32B	76.54%	78.84%	+2.30	high absolute score
Qwen-3-14B	76.44%	77.21%	+0.77	smaller gain from stronger base
Qwen-3-32B	77.35%	79.52%	+2.17	highest reported V2 score

Key takeaway

Diagnostic-driven targeted data is more effective than blindly scaling synthetic data. ProDa V2 reaches 68.72% with 1K targeted repair samples and peaks at 72.11% at the 5K scale in the reported Qwen2.5-7B comparison.

Case studies

Failures become traceable repair tasks

Physics

Fresnel half-wave zone method

Diagnoses a concept gap around the uncancelled Fresnel zone and repairs the intensity reasoning.

Economics and Law

WTO Japan Varietals

Separates proposals from panel findings and reconstructs the relevant SPS Agreement reasoning chain.

Medicine

Hyperkalemia ECG manifestations

Repairs the biophysical mechanism behind sustained depolarization and sodium-channel inactivation.

ProDa Studio

An integrated environment for applying ProDa to new corpora

ProDa Studio organizes the workflow from knowledge extraction to benchmark generation, fine-tuning data generation, model training, evaluation, and the next diagnostic cycle.

Extract Knowledge Core from L3 chains, L2 statements, and L1 concepts.

Generate benchmark and fine-tuning instances with linked knowledge nodes.

Track fine-tuning loss, learning-rate curves, and per-discipline evaluation.

Start the next diagnostic repair cycle from observed failures.

Citation

Draft citation. Verify final author list, venue, arXiv identifier, and URL before publishing.

@misc{pan2026programmingdatatestdrivendata,
          title={Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora}, 
          author={Chenkai Pan and Xinglong Xu and Yuhang Xu and Yujun Wu and Siyuan Li and Jintao Chen and Conghui He and Jingxuan Wei and Cheng Tan},
          year={2026},
          eprint={2604.24819},
          archivePrefix={arXiv},
          primaryClass={cs.SE},
          url={https://arxiv.org/abs/2604.24819}
}