PAGER

Precision-Aware Geometric Reasoning

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

PAGER turns geometric reasoning into pixel-level GUI execution. PAGE Bench measures this precision-sensitive setting with verified GeoGebra trajectories, process supervision, and geometry-aware evaluation.

Jingxuan Wei1,* Xi Bai1,* Shan Liu3,* Caijun Jia1,* Zheng Sun1 Xinglong Xu1 Siyuan Li2 Linzhuang Sun1 Bihui Yu1 Conghui He2 Cheng Tan2
1 University of Chinese Academy of Sciences
2 Shanghai Artificial Intelligence Laboratory
3 China University of Petroleum-Beijing
4,906
geometry problems
224K
GUI actions
53,277
construction tasks
10
skill categories
62.20
step success
29.52
overall score

Motivation

When a correct action still lands in the wrong place

Conventional GUI tasks tolerate nearby clicks inside the same component. Geometric construction does not: a misplaced point changes every dependent line, circle, angle, intersection, label, and polygon.

Point-Level Accuracy

Success is measured by distance to a target point, not membership in a tolerant UI region.

Dependency Coupling

Early coordinate drift cascades through downstream geometric objects and relations.

Continuous Parameters

Drawing requires exact endpoints, radii, text anchors, object styles, and canvas coordinates.

Semantic-Execution Gap

Models can choose the right operation type while failing to execute a valid construction.

PAGER Framework

Plan dependencies, execute pixels, optimize precision

PAGER factorizes drawing into dependency-structured planning and state-conditioned GUI execution, then trains with pixel-grounded SFT and precision-aligned reinforcement learning.

1

Dependency Planning

Induces a construction graph and orders primitives so prerequisite objects are available before dependent operations.

2

Pixel Execution

Grounds each sub-task into click, paint, and type actions conditioned on the current canvas state and action history.

3

Precision Rewards

Combines action-type correctness, parameter accuracy, and rendered geometric validity to reduce rollout drift.

PAGE Bench

A closed execution loop for geometric GUI trajectories

PAGE Bench converts K-12 geometry problems into executable GeoGebra task sequences, replays them in a live GUI, records pixel-level process traces, and filters invalid constructions after verification.

4,443 / 463
train / test split
45.76
actions per problem
88.03%
click + paint actions

Key Statistics

Corpus scale and process scale from the PAGE Bench source table.

Statistic Count Share Process Count
Total problems 4,906 100.00% Total tasks 53,277
Train / Test 4,443 / 463 90.56 / 9.44 Avg. tasks / problem 10.86
Open-ended 2,857 58.23% Total actions 224,497
Grade 8-10+ 4,387 89.42% Click + paint 197,634
Intermediate / Hard 2,940 / 1,677 94.11% Type actions 26,863

Results

PAGER improves trajectory-level stability

Strong VLMs often achieve high action-type accuracy, but task success remains low. PAGER closes this gap by improving parameter control and preserving dependent geometric structure across the rollout.

Task Success vs. Overall Score

Representative models from the main-results table.

Semantic-Execution Gap

Claude-Sonnet-4.6 reaches 95.85 Action Accuracy, and Gemini-3.1-Pro reaches 66.66 Step Success, yet their complete Task Success remains 1.11 and 5.82. PAGER reaches 23.78 Task Success and 29.52 Overall, showing stronger stability across full construction trajectories.

4.1x
task success over Gemini-3.1-Pro
62.20
step success for point-precise control

Main Results on Precise Geometric Tasks

Full table from the paper. Metrics cover action semantics, parameter precision, step/task success, compositional process scores, visual/geometric final quality, and overall performance.

Model Action Param Step Task O-Comp. S-Comp. S-Vis. S-Geo. Middle Final Overall
Open-source VLMs
Qwen3-VL-8B 52.03 9.48 9.38 0.25 3.09 3.83 3.69 3.05 8.18 3.42 5.80
DeepSeek-VL2 26.33 1.75 1.53 0.00 18.25 8.46 3.70 12.40 3.11 11.23 7.17
GLM-4.5V 78.91 12.17 11.76 0.00 5.44 7.93 2.89 13.41 11.46 7.27 9.37
InternVL2.5-8B 40.34 1.48 1.33 0.00 9.23 7.19 1.72 10.85 4.45 7.44 5.94
KimiVL-A3B 46.60 0.59 0.57 0.00 1.70 7.12 1.24 14.05 4.83 5.70 5.27
MiniCPM-V-2.6-8B 45.87 1.06 0.71 0.00 16.01 4.81 1.17 8.18 4.84 8.12 6.48
LLaVA-NeXT-8B 45.77 1.69 1.59 0.00 8.04 5.13 1.74 8.75 5.06 6.05 5.56
Closed-source VLMs
Claude-Sonnet-4.6 95.85 36.44 36.38 1.11 7.04 9.46 3.11 15.39 21.17 8.65 14.91
GPT-5.4 88.04 31.71 31.41 0.56 10.99 9.59 3.53 15.39 18.59 9.96 14.28
Qwen3.6-Plus 90.95 51.60 51.07 4.90 18.19 11.04 4.23 10.52 27.41 11.72 19.56
Gemini-3.1-Pro 89.18 66.68 66.66 5.82 20.97 14.17 7.63 21.21 32.41 16.31 24.36
Specialized GUI Agents
UI-TARS 47.08 8.79 8.38 0.00 7.06 5.26 3.78 5.21 7.26 5.49 6.38
OS-ATLAS 51.24 9.19 8.80 0.29 14.56 6.65 4.32 6.36 7.98 8.50 8.24
InfiGUI-R1-3B 44.81 16.51 16.18 0.00 18.23 9.66 4.14 13.82 9.37 11.96 10.66
OpenCUA-7B 55.86 10.57 9.85 0.12 15.39 8.92 3.11 10.77 8.69 10.07 9.38
GUI-Actor-7B 47.26 5.31 5.02 0.00 18.66 6.04 1.42 7.58 6.26 9.21 7.74
PAGER 82.62 62.76 62.20 23.78 28.88 15.30 7.05 15.63 41.25 17.79 29.52

Detailed Analysis

Evaluation consistency and capability diagnostics

The main result table establishes the headline comparison. These supplementary views show whether the automated verifier agrees with expert judgment and where models fail across fine-grained geometric skills.

Automated Scores Align with Human Judgments

Human ratings track the automated evaluation closely, with a correlation of r=0.9397. Most existing MLLMs cluster in the lower-left region, while PAGER appears in the top-right corner, showing that higher verifier scores correspond to human-perceived geometric correctness.

Fine-Grained Capability Breakdown

The ten-skill heatmap separates easier tool-mapping and object-construction abilities from harder dependency-heavy cases. Baselines drop sharply on multi-step planning, structural constraints, auxiliary elements, and real-world geometric modeling, while PAGER remains consistently stronger.

Case Study

Coordinate drift becomes visible in the final construction

The rectangle example shows how early vertex errors distort diagonals, intersections, and downstream relations, while PAGER better preserves the intended geometric structure.

Error propagation

The rectangle construction requires accurate vertex placement before drawing diagonals and intersection relations. Small coordinate errors in early steps change the geometry that later operations depend on.

Constraint preservation

PAGER keeps the rectangular structure and central diagonal intersection more stable, while baseline models drift into distorted quadrilaterals, redundant elements, or missing construction steps.

Citation

Draft citation from the current paper source. Verify the final venue and identifier before publishing.

@misc{wei2026pager,
  title={PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control},
  author={Jingxuan Wei and Xi Bai and Shan Liu and Caijun Jia and Zheng Sun and Xinglong Xu and Siyuan Li and Linzhuang Sun and Bihui Yu and Conghui He and Cheng Tan},
  year={2026},
  url={}
}