Point-Level Accuracy
Success is measured by distance to a target point, not membership in a tolerant UI region.
Precision-Aware Geometric Reasoning
PAGER turns geometric reasoning into pixel-level GUI execution. PAGE Bench measures this precision-sensitive setting with verified GeoGebra trajectories, process supervision, and geometry-aware evaluation.
Motivation
Conventional GUI tasks tolerate nearby clicks inside the same component. Geometric construction does not: a misplaced point changes every dependent line, circle, angle, intersection, label, and polygon.
Success is measured by distance to a target point, not membership in a tolerant UI region.
Early coordinate drift cascades through downstream geometric objects and relations.
Drawing requires exact endpoints, radii, text anchors, object styles, and canvas coordinates.
Models can choose the right operation type while failing to execute a valid construction.
PAGER Framework
PAGER factorizes drawing into dependency-structured planning and state-conditioned GUI execution, then trains with pixel-grounded SFT and precision-aligned reinforcement learning.
Induces a construction graph and orders primitives so prerequisite objects are available before dependent operations.
Grounds each sub-task into click, paint, and type actions conditioned on the current canvas state and action history.
Combines action-type correctness, parameter accuracy, and rendered geometric validity to reduce rollout drift.
PAGE Bench
PAGE Bench converts K-12 geometry problems into executable GeoGebra task sequences, replays them in a live GUI, records pixel-level process traces, and filters invalid constructions after verification.
Corpus scale and process scale from the PAGE Bench source table.
| Statistic | Count | Share | Process | Count |
|---|---|---|---|---|
| Total problems | 4,906 | 100.00% | Total tasks | 53,277 |
| Train / Test | 4,443 / 463 | 90.56 / 9.44 | Avg. tasks / problem | 10.86 |
| Open-ended | 2,857 | 58.23% | Total actions | 224,497 |
| Grade 8-10+ | 4,387 | 89.42% | Click + paint | 197,634 |
| Intermediate / Hard | 2,940 / 1,677 | 94.11% | Type actions | 26,863 |
Results
Strong VLMs often achieve high action-type accuracy, but task success remains low. PAGER closes this gap by improving parameter control and preserving dependent geometric structure across the rollout.
Representative models from the main-results table.
Claude-Sonnet-4.6 reaches 95.85 Action Accuracy, and Gemini-3.1-Pro reaches 66.66 Step Success, yet their complete Task Success remains 1.11 and 5.82. PAGER reaches 23.78 Task Success and 29.52 Overall, showing stronger stability across full construction trajectories.
Full table from the paper. Metrics cover action semantics, parameter precision, step/task success, compositional process scores, visual/geometric final quality, and overall performance.
| Model | Action | Param | Step | Task | O-Comp. | S-Comp. | S-Vis. | S-Geo. | Middle | Final | Overall |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Open-source VLMs | |||||||||||
| Qwen3-VL-8B | 52.03 | 9.48 | 9.38 | 0.25 | 3.09 | 3.83 | 3.69 | 3.05 | 8.18 | 3.42 | 5.80 |
| DeepSeek-VL2 | 26.33 | 1.75 | 1.53 | 0.00 | 18.25 | 8.46 | 3.70 | 12.40 | 3.11 | 11.23 | 7.17 |
| GLM-4.5V | 78.91 | 12.17 | 11.76 | 0.00 | 5.44 | 7.93 | 2.89 | 13.41 | 11.46 | 7.27 | 9.37 |
| InternVL2.5-8B | 40.34 | 1.48 | 1.33 | 0.00 | 9.23 | 7.19 | 1.72 | 10.85 | 4.45 | 7.44 | 5.94 |
| KimiVL-A3B | 46.60 | 0.59 | 0.57 | 0.00 | 1.70 | 7.12 | 1.24 | 14.05 | 4.83 | 5.70 | 5.27 |
| MiniCPM-V-2.6-8B | 45.87 | 1.06 | 0.71 | 0.00 | 16.01 | 4.81 | 1.17 | 8.18 | 4.84 | 8.12 | 6.48 |
| LLaVA-NeXT-8B | 45.77 | 1.69 | 1.59 | 0.00 | 8.04 | 5.13 | 1.74 | 8.75 | 5.06 | 6.05 | 5.56 |
| Closed-source VLMs | |||||||||||
| Claude-Sonnet-4.6 | 95.85 | 36.44 | 36.38 | 1.11 | 7.04 | 9.46 | 3.11 | 15.39 | 21.17 | 8.65 | 14.91 |
| GPT-5.4 | 88.04 | 31.71 | 31.41 | 0.56 | 10.99 | 9.59 | 3.53 | 15.39 | 18.59 | 9.96 | 14.28 |
| Qwen3.6-Plus | 90.95 | 51.60 | 51.07 | 4.90 | 18.19 | 11.04 | 4.23 | 10.52 | 27.41 | 11.72 | 19.56 |
| Gemini-3.1-Pro | 89.18 | 66.68 | 66.66 | 5.82 | 20.97 | 14.17 | 7.63 | 21.21 | 32.41 | 16.31 | 24.36 |
| Specialized GUI Agents | |||||||||||
| UI-TARS | 47.08 | 8.79 | 8.38 | 0.00 | 7.06 | 5.26 | 3.78 | 5.21 | 7.26 | 5.49 | 6.38 |
| OS-ATLAS | 51.24 | 9.19 | 8.80 | 0.29 | 14.56 | 6.65 | 4.32 | 6.36 | 7.98 | 8.50 | 8.24 |
| InfiGUI-R1-3B | 44.81 | 16.51 | 16.18 | 0.00 | 18.23 | 9.66 | 4.14 | 13.82 | 9.37 | 11.96 | 10.66 |
| OpenCUA-7B | 55.86 | 10.57 | 9.85 | 0.12 | 15.39 | 8.92 | 3.11 | 10.77 | 8.69 | 10.07 | 9.38 |
| GUI-Actor-7B | 47.26 | 5.31 | 5.02 | 0.00 | 18.66 | 6.04 | 1.42 | 7.58 | 6.26 | 9.21 | 7.74 |
| PAGER | 82.62 | 62.76 | 62.20 | 23.78 | 28.88 | 15.30 | 7.05 | 15.63 | 41.25 | 17.79 | 29.52 |
Detailed Analysis
The main result table establishes the headline comparison. These supplementary views show whether the automated verifier agrees with expert judgment and where models fail across fine-grained geometric skills.
Human ratings track the automated evaluation closely, with a correlation of r=0.9397. Most existing MLLMs cluster in the lower-left region, while PAGER appears in the top-right corner, showing that higher verifier scores correspond to human-perceived geometric correctness.
The ten-skill heatmap separates easier tool-mapping and object-construction abilities from harder dependency-heavy cases. Baselines drop sharply on multi-step planning, structural constraints, auxiliary elements, and real-world geometric modeling, while PAGER remains consistently stronger.
Case Study
The rectangle example shows how early vertex errors distort diagonals, intersections, and downstream relations, while PAGER better preserves the intended geometric structure.
The rectangle construction requires accurate vertex placement before drawing diagonals and intersection relations. Small coordinate errors in early steps change the geometry that later operations depend on.
PAGER keeps the rectangular structure and central diagonal intersection more stable, while baseline models drift into distorted quadrilaterals, redundant elements, or missing construction steps.
Draft citation from the current paper source. Verify the final venue and identifier before publishing.
@misc{wei2026pager,
title={PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control},
author={Jingxuan Wei and Xi Bai and Shan Liu and Caijun Jia and Zheng Sun and Xinglong Xu and Siyuan Li and Linzhuang Sun and Bihui Yu and Conghui He and Cheng Tan},
year={2026},
url={}
}