Precision-Aware Geometric Reasoning

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

PAGER turns geometric reasoning into pixel-level GUI execution. PAGE Bench measures this precision-sensitive setting with verified GeoGebra trajectories, process supervision, and geometry-aware evaluation.

Jingxuan Wei^1,* Xi Bai^1,* Shan Liu^3,* Caijun Jia^1,* Zheng Sun¹ Xinglong Xu¹ Siyuan Li² Linzhuang Sun¹ Bihui Yu¹ Conghui He² Cheng Tan²

1 University of Chinese Academy of Sciences

2 Shanghai Artificial Intelligence Laboratory

3 China University of Petroleum-Beijing

Paper

Code

Dataset

4,906

geometry problems

224K

GUI actions

53,277

construction tasks

skill categories

62.20

step success

29.52

overall score

Motivation

When a correct action still lands in the wrong place

Conventional GUI tasks tolerate nearby clicks inside the same component. Geometric construction does not: a misplaced point changes every dependent line, circle, angle, intersection, label, and polygon.

Point-Level Accuracy

Success is measured by distance to a target point, not membership in a tolerant UI region.

Dependency Coupling

Early coordinate drift cascades through downstream geometric objects and relations.

Continuous Parameters

Drawing requires exact endpoints, radii, text anchors, object styles, and canvas coordinates.

Semantic-Execution Gap

Models can choose the right operation type while failing to execute a valid construction.

PAGER Framework

Plan dependencies, execute pixels, optimize precision

PAGER factorizes drawing into dependency-structured planning and state-conditioned GUI execution, then trains with pixel-grounded SFT and precision-aligned reinforcement learning.

Dependency Planning

Induces a construction graph and orders primitives so prerequisite objects are available before dependent operations.

Pixel Execution

Grounds each sub-task into click, paint, and type actions conditioned on the current canvas state and action history.

Precision Rewards

Combines action-type correctness, parameter accuracy, and rendered geometric validity to reduce rollout drift.

PAGE Bench

A closed execution loop for geometric GUI trajectories

PAGE Bench converts K-12 geometry problems into executable GeoGebra task sequences, replays them in a live GUI, records pixel-level process traces, and filters invalid constructions after verification.

4,443 / 463

train / test split

45.76

actions per problem

88.03%

click + paint actions

Key Statistics

Corpus scale and process scale from the PAGE Bench source table.

Statistic	Count	Share	Process	Count
Total problems	4,906	100.00%	Total tasks	53,277
Train / Test	4,443 / 463	90.56 / 9.44	Avg. tasks / problem	10.86
Open-ended	2,857	58.23%	Total actions	224,497
Grade 8-10+	4,387	89.42%	Click + paint	197,634
Intermediate / Hard	2,940 / 1,677	94.11%	Type actions	26,863

Results

PAGER improves trajectory-level stability

Strong VLMs often achieve high action-type accuracy, but task success remains low. PAGER closes this gap by improving parameter control and preserving dependent geometric structure across the rollout.

Task Success vs. Overall Score

Representative models from the main-results table.

Semantic-Execution Gap

Claude-Sonnet-4.6 reaches 95.85 Action Accuracy, and Gemini-3.1-Pro reaches 66.66 Step Success, yet their complete Task Success remains 1.11 and 5.82. PAGER reaches 23.78 Task Success and 29.52 Overall, showing stronger stability across full construction trajectories.

4.1x

task success over Gemini-3.1-Pro

62.20

step success for point-precise control

Main Results on Precise Geometric Tasks

Full table from the paper. Metrics cover action semantics, parameter precision, step/task success, compositional process scores, visual/geometric final quality, and overall performance.

Model	Action	Param	Step	Task	O-Comp.	S-Comp.	S-Vis.	S-Geo.	Middle	Final	Overall
Open-source VLMs
Qwen3-VL-8B	52.03	9.48	9.38	0.25	3.09	3.83	3.69	3.05	8.18	3.42	5.80
DeepSeek-VL2	26.33	1.75	1.53	0.00	18.25	8.46	3.70	12.40	3.11	11.23	7.17
GLM-4.5V	78.91	12.17	11.76	0.00	5.44	7.93	2.89	13.41	11.46	7.27	9.37
InternVL2.5-8B	40.34	1.48	1.33	0.00	9.23	7.19	1.72	10.85	4.45	7.44	5.94
KimiVL-A3B	46.60	0.59	0.57	0.00	1.70	7.12	1.24	14.05	4.83	5.70	5.27
MiniCPM-V-2.6-8B	45.87	1.06	0.71	0.00	16.01	4.81	1.17	8.18	4.84	8.12	6.48
LLaVA-NeXT-8B	45.77	1.69	1.59	0.00	8.04	5.13	1.74	8.75	5.06	6.05	5.56
Closed-source VLMs
Claude-Sonnet-4.6	95.85	36.44	36.38	1.11	7.04	9.46	3.11	15.39	21.17	8.65	14.91
GPT-5.4	88.04	31.71	31.41	0.56	10.99	9.59	3.53	15.39	18.59	9.96	14.28
Qwen3.6-Plus	90.95	51.60	51.07	4.90	18.19	11.04	4.23	10.52	27.41	11.72	19.56
Gemini-3.1-Pro	89.18	66.68	66.66	5.82	20.97	14.17	7.63	21.21	32.41	16.31	24.36
Specialized GUI Agents
UI-TARS	47.08	8.79	8.38	0.00	7.06	5.26	3.78	5.21	7.26	5.49	6.38
OS-ATLAS	51.24	9.19	8.80	0.29	14.56	6.65	4.32	6.36	7.98	8.50	8.24
InfiGUI-R1-3B	44.81	16.51	16.18	0.00	18.23	9.66	4.14	13.82	9.37	11.96	10.66
OpenCUA-7B	55.86	10.57	9.85	0.12	15.39	8.92	3.11	10.77	8.69	10.07	9.38
GUI-Actor-7B	47.26	5.31	5.02	0.00	18.66	6.04	1.42	7.58	6.26	9.21	7.74
PAGER	82.62	62.76	62.20	23.78	28.88	15.30	7.05	15.63	41.25	17.79	29.52

Detailed Analysis

Evaluation consistency and capability diagnostics

The main result table establishes the headline comparison. These supplementary views show whether the automated verifier agrees with expert judgment and where models fail across fine-grained geometric skills.

Automated Scores Align with Human Judgments

Human ratings track the automated evaluation closely, with a correlation of r=0.9397. Most existing MLLMs cluster in the lower-left region, while PAGER appears in the top-right corner, showing that higher verifier scores correspond to human-perceived geometric correctness.

Fine-Grained Capability Breakdown

The ten-skill heatmap separates easier tool-mapping and object-construction abilities from harder dependency-heavy cases. Baselines drop sharply on multi-step planning, structural constraints, auxiliary elements, and real-world geometric modeling, while PAGER remains consistently stronger.

Case Study

Coordinate drift becomes visible in the final construction

The rectangle example shows how early vertex errors distort diagonals, intersections, and downstream relations, while PAGER better preserves the intended geometric structure.

Error propagation

The rectangle construction requires accurate vertex placement before drawing diagonals and intersection relations. Small coordinate errors in early steps change the geometry that later operations depend on.

Constraint preservation

PAGER keeps the rectangular structure and central diagonal intersection more stable, while baseline models drift into distorted quadrilaterals, redundant elements, or missing construction steps.

Citation

Draft citation from the current paper source. Verify the final venue and identifier before publishing.

@misc{wei2026pager,
  title={PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control},
  author={Jingxuan Wei and Xi Bai and Shan Liu and Caijun Jia and Zheng Sun and Xinglong Xu and Siyuan Li and Linzhuang Sun and Bihui Yu and Conghui He and Cheng Tan},
  year={2026},
  url={}
}