Editing as Reasoning — Amaze-Bench Leaderboard

Editing-as-Reasoning (EAR) turns visual planning from step-by-step generation into a single-step image transformation.

HuggingFace Dataset Github Code Paper (Coming Soon)

Violation & Coverage

Violation (↓): Percentage of predicted path cells that fall in non-GT cells (%).

Coverage (↑): Percentage of predicted path cells that fall in GT cells (%).

MSE In & MSE Out

MSE In (↓): MSE in gt path region.

MSE Out (↓): MSE in non-gt path region.

Pass@1 & Pass@5

Pass@1 (↑): Percentage of samples that generated a valid path in one generation.

Pass@5 (↑): Percentage of samples that generated a valid path at least five times.

TaskEditing as Reasoning
DatasetAmaze
Model Violation ↓ Coverage ↑ MSE In ↓ MSE Out ↓ Pass@1 ↑ Pass@5 ↑
closed-source models
Closed
GPT-image-1
closed-source
62.88 58.97 41.16 52.76
5.40
6.06
Closed
NanoBanana-Pro
closed-source
47.76 64.21 24.20 17.21
4.82
9.28
Closed
Seedream-4.5
closed-source
16.90 25.67 28.82 30.96
2.14
3.21
open-source models
Open
Flux-Kontext-Dev
open-source
23.84 30.24 30.96 18.31
0.36
3.57
Open
Qwen-Image-Edit
open-source
19.37 28.51 18.82 5.70
1.43
2.14
Open
Bagel
open-source (base)
28.91 27.15 11.64 5.84
0.00
1.00
Open
Janus-Pro
open-source (base)
5.41 1.85 57.47 76.80
0.00
0.00
fine-tuned models
FT
Bagel (fine-tuned)
SFT on 3×3
12.21 51.02 8.66 3.07
11.54
23.64
FT
Janus-Pro (fine-tuned)
SFT on 3×3
35.60 23.33 55.99 50.94
1.43
2.22
Tip: Click table headers to sort within each group.