Interleaved text-image reasoning

Bridging Modal Isolation in Interleaved Thinking

MoTiF supervises the transition points where text reasoning becomes visual generation and generated images become text observations. The framework combines Reflective SFT with Flow-GRPO to reduce cross-modal hallucination without relying on final-answer rewards.

Tingyu Li, Le Zhou, Siyuan Li, Yujun Wu, Xinglong Xu, Jingxuan Wei, Conghui He, Cheng Tan

Shanghai Artificial Intelligence Laboratory, Shanghai Jiaotong University, Zhejiang University, University of Chinese Academy of Sciences

4visual puzzle benchmarks
70.98overall score after Stage 2
+45.61overall gain versus BaseModel
+6.21average R_u2g transition gain
Modal isolation: text and image steps alternate, but the world state no longer transfers faithfully across the boundary.

Failure mode

The bottleneck sits at the modality boundary

Long-chain multimodal reasoning can fail even when the model keeps producing both text and images. The generated image can drift away from the instruction, and the following text can ignore visible contradictions.

Cross-modal hallucination

The text step encodes an intended world state, but the image step renders a different state. MoTiF treats this as information loss from understanding to generation.

Visual utilization deficit

The next text step observes an image but fails to decode the evidence it actually contains. MoTiF directly trains the text side to detect and recover from this mismatch.

Atomic decomposition makes the two failure directions measurable: text-to-image fidelity and image-to-text utilization.

MoTiF framework

Transition-level supervision, two training paths

Each training signal is attached to a single modality transition. This keeps the optimization aligned with the structural failure instead of rewarding lucky final answers.

1

Measure transition loss

A rubric-based VLM judge scores whether generated images match textual instructions and whether text correctly observes generated images.

2

Reflective SFT

Corrupted intermediate images force the model to identify visual errors, ignore or redraw them, and continue reasoning from the intended state.

3

Flow-GRPO

Image generation is optimized with transition rewards so visual outputs better preserve the state specified by the preceding text instruction.

Reflective SFT teaches the model to handle wrong generated images instead of silently trusting them.

Inject a controlled visual error

Replace one intermediate image with a corrupted variant so the next text step has to respond to visual evidence, not only the previous plan.

Detect the mismatch

The rewritten text first observes what is actually visible, then explains why that image conflicts with the intended drawing instruction.

Recover the chain state

Training examples use detect-and-ignore or detect-and-redraw patterns, keeping the reasoning chain aligned with the correct world state.

Filter by transition fidelity

Rubric-based judges retain samples where the text correctly uses visual information and produces an executable next drawing plan.

Why not only final rewards?

End-task rewards can accept chains that arrive at the right answer through inconsistent intermediate states. MoTiF supervises the local boundary where information is lost.

Benchmarks and data

Four tasks stress different reasoning loads

The suite spans planning, navigation, object manipulation, and physical reflection. Dataset tables are included in full for reproducibility.

Sokoban grid example used for visual planning evaluation.

Sokoban

Planning-heavy box-pushing tasks where understanding carries much of the reasoning load.

Maze navigation example with start, walls, and goal state.

Maze

Long-chain navigation where text plans must remain consistent with rendered paths.

Multi-hop manipulation scene thumbnail with colored objects for spatial operations.

Manipulation

Object removal, insertion, attribute changes, and spatial updates over rendered scenes.

Ball tracking example involving specular reflection reasoning.

Ball Tracking

Reflection trajectories where visual world modeling provides critical physical intuition.

Naive interleaving SFT datasets Full table
Table 1: Naive interleaving SFT datasets.
TaskReferenceSize
SokobanGame-RL7,997
MazeGame-RL7,365
Multi-hop ManipulationCLEVER8,000
Ball TrackingRBench-V8,000
Detailed dataset information Full table
Table 6: Detailed Dataset Information.
TaskSize-RLSize-SFT
Sokoban6,07415,994
Maze3,94514,730
Manipulation4,03416,000
Ball Tracking4,61516,000

Main results

MoTiF improves open-source interleaved reasoning

The complete benchmark table is preserved, including frontier models, open-source baselines, both MoTiF stages, and the maximum gain over the base model.

Model performance on visual puzzle benchmarks Full table
Table 2: Model performance on our benchmarks. Blue marks the best value and green marks the second-best value within the relevant comparison.
Model Sokoban Maze Manipulation Ball Tracking Overall
Frontier Models
Gemini3.5-Flash85.7197.4793.3377.3388.46
Gemini3.1-Flash-Lite37.3011.3988.0040.0044.17
Seed2.0-Lite-021569.0564.5689.3358.6770.40
Open-Source Models
Qwen3.5-27B80.9516.4686.6732.0054.02
Gemma-4-31B66.6713.9290.6732.0050.82
Bagel-7B-MoT16.6710.1345.3329.3325.37
ThinkMorph18.258.8669.3348.0036.11
w.o. interleave thinking22.225.0665.3337.3331.74
Our Models
Ours (Stage1: optimize L_gen-to-und)43.6567.0987.6770.3367.19
Ours (Stage2: optimize L_und-to-gen)50.0070.5790.6772.6770.98
Delta max (vs. BaseModel)+33.33+60.44+45.34+43.34+45.61
Accuracy and transition reward curves evolve together, indicating that improving modality fidelity supports final task performance.

Training analysis

Flow-GRPO improves generation fidelity with bounded tradeoff

The paper reports the transition rewards before and after Flow-GRPO, then lists training hyperparameters for both stages.

Flow-GRPO raises understanding-to-generation fidelity across all tasks while keeping generation-to-understanding degradation bounded.
Before and after Flow-GRPO Full table
Table 5: Comparison before and after Flow-GRPO.
TaskR_g2uR_u2g
PrePostPrePost
Sokoban52.8051.2056.0060.80
Maze64.0065.3380.0088.00
Manipulation82.6781.3349.3358.67
Ball62.6760.0073.3376.00
Average65.5364.4764.6670.87
Reflective SFT hyperparameters Full table
Table 3: Reflective SFT experiment hyperparameters.
ParameterValue
expected_num_tokens34560
max_num_tokens34560
max_num_tokens_per_sample17280
prefer_buffer_before17280
total_steps3020
warmup_steps150
save_every300
lr_schedulercosine
mse_weight10
ce_weight1
Flow-GRPO hyperparameters Full table
Table 4: Flow GRPO experiment hyperparameters.
ParameterValue
judge_modelqwen3.5_27b
train_batch_size6
num_image_per_prompt16
num_batches_per_epoch1
learning_rate5e-6
beta0.04
sde_window_size3
seed42
noise_level1.3

Qualitative traces

Intermediate images expose when a chain stays grounded

The examples emphasize why transition fidelity matters: the visual state must be both rendered faithfully and read back into the next text step.

Maze step 1: a path state that the next text segment must inspect rather than assume.

What the image tests

The maze state checks whether the chain preserves path geometry, walls, and start-to-goal constraints after an intermediate drawing step.

Where isolation appears

Modal isolation occurs when the next text step validates a visually invalid path instead of reading the evidence from the generated image.

What MoTiF should preserve

A grounded chain must carry the same task-relevant world state across text, image, and the following observation before advancing.

Citation

Citation

BibTeX
@unpublished{liXXXXbridging,
  title = {{Bridging Modal Isolation in Interleaved Thinking: Supervising Modality Transitions via Stepwise Reinforcement}},
  author = {Tingyu Li and Le Zhou and Siyuan Li and Yujun Wu and Xinglong Xu and Jingxuan Wei and Conghui He and Cheng Tan},
  year = {XXXX}
}