Measure transition loss
A rubric-based VLM judge scores whether generated images match textual instructions and whether text correctly observes generated images.
Interleaved text-image reasoning
MoTiF supervises the transition points where text reasoning becomes visual generation and generated images become text observations. The framework combines Reflective SFT with Flow-GRPO to reduce cross-modal hallucination without relying on final-answer rewards.
Failure mode
Long-chain multimodal reasoning can fail even when the model keeps producing both text and images. The generated image can drift away from the instruction, and the following text can ignore visible contradictions.
The text step encodes an intended world state, but the image step renders a different state. MoTiF treats this as information loss from understanding to generation.
The next text step observes an image but fails to decode the evidence it actually contains. MoTiF directly trains the text side to detect and recover from this mismatch.
MoTiF framework
Each training signal is attached to a single modality transition. This keeps the optimization aligned with the structural failure instead of rewarding lucky final answers.
A rubric-based VLM judge scores whether generated images match textual instructions and whether text correctly observes generated images.
Corrupted intermediate images force the model to identify visual errors, ignore or redraw them, and continue reasoning from the intended state.
Image generation is optimized with transition rewards so visual outputs better preserve the state specified by the preceding text instruction.
Replace one intermediate image with a corrupted variant so the next text step has to respond to visual evidence, not only the previous plan.
The rewritten text first observes what is actually visible, then explains why that image conflicts with the intended drawing instruction.
Training examples use detect-and-ignore or detect-and-redraw patterns, keeping the reasoning chain aligned with the correct world state.
Rubric-based judges retain samples where the text correctly uses visual information and produces an executable next drawing plan.
End-task rewards can accept chains that arrive at the right answer through inconsistent intermediate states. MoTiF supervises the local boundary where information is lost.
Benchmarks and data
The suite spans planning, navigation, object manipulation, and physical reflection. Dataset tables are included in full for reproducibility.
Planning-heavy box-pushing tasks where understanding carries much of the reasoning load.
Long-chain navigation where text plans must remain consistent with rendered paths.
Object removal, insertion, attribute changes, and spatial updates over rendered scenes.
Reflection trajectories where visual world modeling provides critical physical intuition.
| Task | Reference | Size |
|---|---|---|
| Sokoban | Game-RL | 7,997 |
| Maze | Game-RL | 7,365 |
| Multi-hop Manipulation | CLEVER | 8,000 |
| Ball Tracking | RBench-V | 8,000 |
| Task | Size-RL | Size-SFT |
|---|---|---|
| Sokoban | 6,074 | 15,994 |
| Maze | 3,945 | 14,730 |
| Manipulation | 4,034 | 16,000 |
| Ball Tracking | 4,615 | 16,000 |
Main results
The complete benchmark table is preserved, including frontier models, open-source baselines, both MoTiF stages, and the maximum gain over the base model.
| Model | Sokoban | Maze | Manipulation | Ball Tracking | Overall |
|---|---|---|---|---|---|
| Frontier Models | |||||
| Gemini3.5-Flash | 85.71 | 97.47 | 93.33 | 77.33 | 88.46 |
| Gemini3.1-Flash-Lite | 37.30 | 11.39 | 88.00 | 40.00 | 44.17 |
| Seed2.0-Lite-0215 | 69.05 | 64.56 | 89.33 | 58.67 | 70.40 |
| Open-Source Models | |||||
| Qwen3.5-27B | 80.95 | 16.46 | 86.67 | 32.00 | 54.02 |
| Gemma-4-31B | 66.67 | 13.92 | 90.67 | 32.00 | 50.82 |
| Bagel-7B-MoT | 16.67 | 10.13 | 45.33 | 29.33 | 25.37 |
| ThinkMorph | 18.25 | 8.86 | 69.33 | 48.00 | 36.11 |
| w.o. interleave thinking | 22.22 | 5.06 | 65.33 | 37.33 | 31.74 |
| Our Models | |||||
| Ours (Stage1: optimize L_gen-to-und) | 43.65 | 67.09 | 87.67 | 70.33 | 67.19 |
| Ours (Stage2: optimize L_und-to-gen) | 50.00 | 70.57 | 90.67 | 72.67 | 70.98 |
| Delta max (vs. BaseModel) | +33.33 | +60.44 | +45.34 | +43.34 | +45.61 |
Training analysis
The paper reports the transition rewards before and after Flow-GRPO, then lists training hyperparameters for both stages.
| Task | R_g2u | R_u2g | ||
|---|---|---|---|---|
| Pre | Post | Pre | Post | |
| Sokoban | 52.80 | 51.20 | 56.00 | 60.80 |
| Maze | 64.00 | 65.33 | 80.00 | 88.00 |
| Manipulation | 82.67 | 81.33 | 49.33 | 58.67 |
| Ball | 62.67 | 60.00 | 73.33 | 76.00 |
| Average | 65.53 | 64.47 | 64.66 | 70.87 |
| Parameter | Value |
|---|---|
| expected_num_tokens | 34560 |
| max_num_tokens | 34560 |
| max_num_tokens_per_sample | 17280 |
| prefer_buffer_before | 17280 |
| total_steps | 3020 |
| warmup_steps | 150 |
| save_every | 300 |
| lr_scheduler | cosine |
| mse_weight | 10 |
| ce_weight | 1 |
| Parameter | Value |
|---|---|
| judge_model | qwen3.5_27b |
| train_batch_size | 6 |
| num_image_per_prompt | 16 |
| num_batches_per_epoch | 1 |
| learning_rate | 5e-6 |
| beta | 0.04 |
| sde_window_size | 3 |
| seed | 42 |
| noise_level | 1.3 |
Qualitative traces
The examples emphasize why transition fidelity matters: the visual state must be both rendered faithfully and read back into the next text step.
The maze state checks whether the chain preserves path geometry, walls, and start-to-goal constraints after an intermediate drawing step.
Modal isolation occurs when the next text step validates a visually invalid path instead of reading the evidence from the generated image.
A grounded chain must carry the same task-relevant world state across text, image, and the following observation before advancing.
Citation
@unpublished{liXXXXbridging,
title = {{Bridging Modal Isolation in Interleaved Thinking: Supervising Modality Transitions via Stepwise Reinforcement}},
author = {Tingyu Li and Le Zhou and Siyuan Li and Yujun Wu and Xinglong Xu and Jingxuan Wei and Conghui He and Cheng Tan},
year = {XXXX}
}