Interleaved text-image reasoning

Bridging Modal Isolation in Interleaved Thinking

MoTiF supervises the transition points where text reasoning becomes visual generation and generated images become text observations. The framework combines Reflective SFT with Flow-GRPO to reduce cross-modal hallucination without relying on final-answer rewards.

Tingyu Li, Le Zhou, Siyuan Li, Yujun Wu, Xinglong Xu, Jingxuan Wei, Conghui He, Cheng Tan

Shanghai Artificial Intelligence Laboratory, Shanghai Jiaotong University, Zhejiang University, University of Chinese Academy of Sciences

Project Page Code Dataset

4visual puzzle benchmarks

70.98overall score after Stage 2

+45.61overall gain versus BaseModel

+6.21average R_u2g transition gain

Diagram showing modal isolation in a maze navigation task where visual generation and later text reasoning diverge. — Modal isolation: text and image steps alternate, but the world state no longer transfers faithfully across the boundary.

Failure mode

The bottleneck sits at the modality boundary

Long-chain multimodal reasoning can fail even when the model keeps producing both text and images. The generated image can drift away from the instruction, and the following text can ignore visible contradictions.

Cross-modal hallucination

The text step encodes an intended world state, but the image step renders a different state. MoTiF treats this as information loss from understanding to generation.

Visual utilization deficit

The next text step observes an image but fails to decode the evidence it actually contains. MoTiF directly trains the text side to detect and recover from this mismatch.

Atomic decomposition of interleaved thinking with text-to-image and image-to-text transition losses. — Atomic decomposition makes the two failure directions measurable: text-to-image fidelity and image-to-text utilization.

MoTiF framework

Transition-level supervision, two training paths

Each training signal is attached to a single modality transition. This keeps the optimization aligned with the structural failure instead of rewarding lucky final answers.

Measure transition loss

A rubric-based VLM judge scores whether generated images match textual instructions and whether text correctly observes generated images.

Reflective SFT

Corrupted intermediate images force the model to identify visual errors, ignore or redraw them, and continue reasoning from the intended state.

Flow-GRPO

Image generation is optimized with transition rewards so visual outputs better preserve the state specified by the preceding text instruction.

Reflective SFT data collection pipeline with corrupted visual states and recovery text. — Reflective SFT teaches the model to handle wrong generated images instead of silently trusting them.

Inject a controlled visual error

Replace one intermediate image with a corrupted variant so the next text step has to respond to visual evidence, not only the previous plan.

Detect the mismatch

The rewritten text first observes what is actually visible, then explains why that image conflicts with the intended drawing instruction.

Recover the chain state

Training examples use detect-and-ignore or detect-and-redraw patterns, keeping the reasoning chain aligned with the correct world state.

Filter by transition fidelity

Rubric-based judges retain samples where the text correctly uses visual information and produces an executable next drawing plan.

Why not only final rewards?

End-task rewards can accept chains that arrive at the right answer through inconsistent intermediate states. MoTiF supervises the local boundary where information is lost.

Benchmarks and data

Four tasks stress different reasoning loads

The suite spans planning, navigation, object manipulation, and physical reflection. Dataset tables are included in full for reproducibility.

Sokoban

Planning-heavy box-pushing tasks where understanding carries much of the reasoning load.

Maze

Long-chain navigation where text plans must remain consistent with rendered paths.

Manipulation

Object removal, insertion, attribute changes, and spatial updates over rendered scenes.

Ball Tracking

Reflection trajectories where visual world modeling provides critical physical intuition.

Naive interleaving SFT datasets Full table

Table 1: Naive interleaving SFT datasets.
Task	Reference	Size
Sokoban	Game-RL	7,997
Maze	Game-RL	7,365
Multi-hop Manipulation	CLEVER	8,000
Ball Tracking	RBench-V	8,000

Detailed dataset information Full table

Table 6: Detailed Dataset Information.
Task	Size-RL	Size-SFT
Sokoban	6,074	15,994
Maze	3,945	14,730
Manipulation	4,034	16,000
Ball Tracking	4,615	16,000

Main results

MoTiF improves open-source interleaved reasoning

The complete benchmark table is preserved, including frontier models, open-source baselines, both MoTiF stages, and the maximum gain over the base model.

Model performance on visual puzzle benchmarks Full table

Table 2: Model performance on our benchmarks. Blue marks the best value and green marks the second-best value within the relevant comparison.
Model	Sokoban	Maze	Manipulation	Ball Tracking	Overall
Frontier Models
Gemini3.5-Flash	85.71	97.47	93.33	77.33	88.46
Gemini3.1-Flash-Lite	37.30	11.39	88.00	40.00	44.17
Seed2.0-Lite-0215	69.05	64.56	89.33	58.67	70.40
Open-Source Models
Qwen3.5-27B	80.95	16.46	86.67	32.00	54.02
Gemma-4-31B	66.67	13.92	90.67	32.00	50.82
Bagel-7B-MoT	16.67	10.13	45.33	29.33	25.37
ThinkMorph	18.25	8.86	69.33	48.00	36.11
w.o. interleave thinking	22.22	5.06	65.33	37.33	31.74
Our Models
Ours (Stage1: optimize L_gen-to-und)	43.65	67.09	87.67	70.33	67.19
Ours (Stage2: optimize L_und-to-gen)	50.00	70.57	90.67	72.67	70.98
Delta max (vs. BaseModel)	+33.33	+60.44	+45.34	+43.34	+45.61

Line plots showing model accuracy and cross-modal rewards across tasks during training. — Accuracy and transition reward curves evolve together, indicating that improving modality fidelity supports final task performance.

Training analysis

Flow-GRPO improves generation fidelity with bounded tradeoff

The paper reports the transition rewards before and after Flow-GRPO, then lists training hyperparameters for both stages.

Bar chart comparing R_g2u and R_u2g before and after Flow-GRPO training. — Flow-GRPO raises understanding-to-generation fidelity across all tasks while keeping generation-to-understanding degradation bounded.

Before and after Flow-GRPO Full table

Table 5: Comparison before and after Flow-GRPO.
Task	R_g2u		R_u2g
Task	Pre	Post	Pre	Post
Sokoban	52.80	51.20	56.00	60.80
Maze	64.00	65.33	80.00	88.00
Manipulation	82.67	81.33	49.33	58.67
Ball	62.67	60.00	73.33	76.00
Average	65.53	64.47	64.66	70.87

Reflective SFT hyperparameters Full table

Table 3: Reflective SFT experiment hyperparameters.
Parameter	Value
expected_num_tokens	34560
max_num_tokens	34560
max_num_tokens_per_sample	17280
prefer_buffer_before	17280
total_steps	3020
warmup_steps	150
save_every	300
lr_scheduler	cosine
mse_weight	10
ce_weight	1

Flow-GRPO hyperparameters Full table

Table 4: Flow GRPO experiment hyperparameters.
Parameter	Value
judge_model	qwen3.5_27b
train_batch_size	6
num_image_per_prompt	16
num_batches_per_epoch	1
learning_rate	5e-6
beta	0.04
sde_window_size	3
seed	42
noise_level	1.3

Qualitative traces

Intermediate images expose when a chain stays grounded

The examples emphasize why transition fidelity matters: the visual state must be both rendered faithfully and read back into the next text step.

First intermediate maze image in an interleaved reasoning chain. — Maze step 1: a path state that the next text segment must inspect rather than assume.

What the image tests

The maze state checks whether the chain preserves path geometry, walls, and start-to-goal constraints after an intermediate drawing step.

Where isolation appears

Modal isolation occurs when the next text step validates a visually invalid path instead of reading the evidence from the generated image.

What MoTiF should preserve

A grounded chain must carry the same task-relevant world state across text, image, and the following observation before advancing.

Citation

BibTeX

@unpublished{liXXXXbridging,
  title = {{Bridging Modal Isolation in Interleaved Thinking: Supervising Modality Transitions via Stepwise Reinforcement}},
  author = {Tingyu Li and Le Zhou and Siyuan Li and Yujun Wu and Xinglong Xu and Jingxuan Wei and Conghui He and Cheng Tan},
  year = {XXXX}
}