The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei² Siyuan Li³ Yuhang Xu² Zheng Sun² Junjie Jiang² Hexuan Jin² Caijun Jia² Honghao He² Xinglong Xu² Xi Bai² Chang Yu³ Yumou Liu⁵ Junnan Zhu² Xuanhe Zhou⁵ Jintao Chen⁶ Xiaobin Hu⁴ Shancheng Pang⁷ Bihui Yu² Ran He² Zhen Lei² Stan Z. Li³ Conghui He¹ Shuicheng Yan⁴ Cheng Tan¹

¹ Shanghai Artificial Intelligence Laboratory ² University of Chinese Academy of Sciences ³ Westlake University ⁴ National University of Singapore ⁵ Shanghai Jiaotong University ⁶ Zhejiang University ⁷ China University of Petroleum (East China)

Paper (Arxiv) 🤗 Dataset Code Leaderboard

The Trinity of Consistency: Modal, Spatial, and Temporal

Why Do We Need a General World Model?

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Existing models often behave as naive physicists; we propose that a robust World Model must be grounded in the Trinity of Consistency.

Text ↔ Vision

Modal Consistency

The Interface

The ability to align heterogeneous information (text, image, tactile) into a unified semantic space, serving as the cognitive interface for instruction and feedback.

3D Geometry

Spatial Consistency

The Basis

The capacity to construct a 3D-aware representation that respects geometry, occlusion, and object permanence, ensuring the static plausibility of the simulated world.

Causality & Physics

Temporal Consistency

The Engine

The adherence to physical laws and causal logic over time, ensuring that dynamic evolution follows a predictable and logically sound trajectory.

Evolution of World Models

From loosely coupled specialized modules toward unified general architectures.

PHASE 1

Modal Evolution

Transitioning from geometric isolation to native unified models.

Dual-Tower MM-DiT

PHASE 2

Spatial Evolution

From implicit fields and primitives to generative priors.

NeRF 3DGS

PHASE 3

Temporal Evolution

From autoregressive modeling to Diffusion Transformers.

Diffusion Transformers

CoW-Bench: The Ultimate Test

CoW-Bench (Consistency of World-models Benchmark) rigorously tests the model's ability to maintain the Trinity of Consistency under complex scenarios. It includes 1,485 meticulously constructed samples, organized into 18 fine-grained sub-tasks spanning Modal, Spatial, Temporal dimensions and their cross-axis synergies.

**Hierarchical Taxonomy of CoW-Bench.** The inner ring represents the main consistency dimensions (Modal, Space, Time), while the outer ring details the 18 fine-grained sub-tasks. The uniform sector sizes visually confirm the rigorously balanced distribution of the dataset.

Empirical Results & Dynamic Leaderboard

Comprehensive evaluation across 18 sub-tasks reveals the current state of World Models.

Interactive Performance Analysis

Click legend to toggle models

Overall Performance Comparison

Performance Comparison of Mainstream Models

Detailed Error Analysis (Heatmaps)

Modal Consistency

Performance distribution across multimodal alignment tasks.

Spatial Consistency

Spatial reasoning capabilities and local geometry.

Temporal Consistency

Temporal coherence and long-term object permanence.

"Heatmaps over sub-metrics reveal that while local geometry plausibility is strong, models struggle with explicit goal-conditioned state tracking (e.g., TS-Maze-2D) and semantic role binding."

CoW-Bench Leaderboard

AVG is rescaled from the original scale of [0, 10] to a percentage scale of [0, 100].

Model	AVG	Modal			Temporal			Spatial			M-T			M-S			T-S
		SUAT	LCED	MCON	WLIN	SLEV	STOR	SEPL	OCCO	MV3D	TREV	LOHO	ATDY	SEPL	OCCO	SEMV	3DLO	OCMO	MAZE

Core Diagnostic Conclusion

"Models achieve strong local motion plausibility yet fail when tasks require a persistent, goal-directed world state. The constraint backoff phenomenon is widespread—models silently replace strict constraints with common defaults to maintain visual plausibility."

Challenges & Future Outlook

From Generative Models to General World Simulators

Differentiability of Physical Authenticity

Embedding Hamiltonians and conservation laws into differentiable operators.

Brittleness of Long-term Causal Chains

Resisting the butterfly effect in hour-day scale generations.

Paradigm Shift to Controllability

Upgrading prompts to APIs for editable online simulations.

Ultimate Outlook

"A Prompt-as-Action paradigm in which UMMs with modality consistency and video generation models with spatial–temporal consistency are unified. Equipped with an internal semantic compiler, such models can interpret high-dimensional natural-language prompts and translate them into universal spatiotemporal simulations that adhere to the Trinity of Consistency."

Citation

@misc{wei2026trinityconsistencydefiningprinciple,
      title={The Trinity of Consistency as a Defining Principle for General World Models}, 
      author={Jingxuan Wei and Siyuan Li and Yuhang Xu and Zheng Sun and Junjie Jiang and Hexuan Jin and Caijun Jia and Honghao He and Xinglong Xu and Xi bai and Chang Yu and Yumou Liu and Junnan Zhu and Xuanhe Zhou and Jintao Chen and Xiaobin Hu and Shancheng Pang and Bihui Yu and Ran He and Zhen Lei and Stan Z. Li and Conghui He and Shuicheng Yan and Cheng Tan},
      year={2026},
      eprint={2602.23152},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.23152}, 
}

Code and Dataset are released under Apache 2.0 License.

The Trinity of Consistency as a Defining Principle for General World Models

Why Do We Need a General World Model?

Modal Consistency

Spatial Consistency

Temporal Consistency

Evolution of World Models

Modal Evolution

Spatial Evolution

Temporal Evolution

CoW-Bench: The Ultimate Test

Empirical Results & Dynamic Leaderboard

Interactive Performance Analysis

Overall Performance Comparison

Detailed Error Analysis (Heatmaps)

Modal Consistency

Spatial Consistency

Temporal Consistency

CoW-Bench Leaderboard

Core Diagnostic Conclusion

Challenges & Future Outlook

Differentiability of Physical Authenticity

Brittleness of Long-term Causal Chains

Paradigm Shift to Controllability

Ultimate Outlook

Citation

Task Details