Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

Chun-Hsiao Yeh^1,3, Yilin Wang³, Nanxuan Zhao³, Richard Zhang³, Yuheng Li³, Yi Ma^1,2, Krishna Kumar Singh³

¹UC Berkeley, ²HKU, ³Adobe

ArXiv 2025

TL;DR: We introduce X-Planner, a MLLM-based framework that decomposes complex image editing instructions into a sequence of interpretable and simpler sub-instructions with region-aware editing guidances: segmentation masks and bounding boxes.

How Does the X-Planner Work?

X-Planner is composed of a MLLM and a segmentation decoder, which together decompose complex instructions into sub-tasks, and generate precise segmentation masks or bounding boxes. By dynamically selecting from a pool of specialized editing models, X-Planner performs localized edits in an iterative and interpretable way — enabling fine-grained, multi-object, and multi-step visual transformations with minimal manual effort.

How Do We Prepare the Training Data to Learn the X-Planner?

Our framework is organized into a three-level pipeline for generating high-quality instruction-editing data.
- [Level 1] focuses on generating complex instruction–simple instruction pairs using a structured GPT-4o prompting template. These include indirect, multi-object, and multi-step edits, each annotated with object anchors and associated edit types.
- [Level 2] handles mask generation: given a source image and anchor text, we use Grounded SAM to extract fine-grained object masks, which are then refined based on the specific edit type using tailored strategies.
- [Level 3] addresses insertion tasks, where the object to be inserted does not exist in the original image. Here, we pre-train an MLLM on bounding box–annotated data to localize pseudo bounding boxes for such objects, ensuring the instruction is grounded even when the visual evidence is absent.

X-Planner's Qualitative Comparison

Click the arrows or dots above to browse more qualitative results!

Integrating X-Planner with InstructPix2Pix* and UltraEdit significantly improves object identity preservation and instruction alignment by leveraging generated masks and bounding boxes (displayed in the bottom-left of each image). Unlike baselines relying solely on complex prompts without masks, X-Planner's decomposition enables more precise and interpretable edits.

X-Planner's Quantitative Comparison on COMPIE Benchmark

X-Planner boosts UltraEdit and InstructPix2Pix* by breaking down complex instructions and supplying control signals like masks. To address CLIP_out ’s limitations on complex edits, we use an MLLM-based (InternVL2-76B) metric to better showcase X-Planner’s effectiveness.

Visualize Consistent Bounding Box with Repeated Runs

We demonstrate that X-Planner consistently generates plausible and semantically meaningful bounding boxes across repeated runs for insertion tasks. While maintaining alignment with the instruction, the model introduces natural location variations—such as different positions for objects like unicorns, laptops, or palm trees.

User Study on COMPIE Benchmark

We compare against InstructPix2Pix* and UltraEdit. “Better” means the gener- ated images by using our X-Planner is preferred and vice versa.

BibTeX



    @article{yeh2025beyond,
      title={Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing},
      author={Yeh, Chun-Hsiao and Wang, Yilin and Zhao, Nanxuan and Zhang, Richard and Li, Yuheng and Ma, Yi and Singh, Krishna Kumar},
      journal={arXiv preprint arXiv:2507.05259},
      year={2025}
    }