X-Planner

Figure 1. X-Planner decomposes complex editing instructions into interpretable sub-instructions with region-aware guidance. Recent IC-Edit (Zhang et al., NeurIPS 2025) and GPT-4o struggle with complex instruction understanding and object identity preservation.

TL;DR

Complex Instruction Decomposition: X-Planner is a MLLM-based framework that breaks down complex image editing instructions into a sequence of interpretable, simpler sub-instructions with region-aware editing guidance.
Region-Aware Control Signals: Each sub-instruction is paired with precise segmentation masks and bounding boxes, enabling fine-grained, multi-object, and multi-step visual transformations with minimal manual effort.
Flexible and Modular: X-Planner dynamically selects from a pool of specialized editing models, performing localized edits in an iterative and interpretable manner that significantly improves object identity preservation and instruction alignment.

Abstract

Instruction-based image editing has seen rapid progress, yet existing methods often struggle with complex, multi-step instructions that involve multiple objects and require precise spatial understanding. We introduce X-Planner, a MLLM-based framework that decomposes complex image editing instructions into a sequence of interpretable and simpler sub-instructions with region-aware editing guidance: segmentation masks and bounding boxes. X-Planner is composed of a MLLM and a segmentation decoder, which together decompose complex instructions into sub-tasks and generate precise segmentation masks or bounding boxes. By dynamically selecting from a pool of specialized editing models, X-Planner performs localized edits in an iterative and interpretable way, enabling fine-grained, multi-object, and multi-step visual transformations with minimal manual effort. We demonstrate that integrating X-Planner with existing editing models (InstructPix2Pix* and UltraEdit) significantly improves both object identity preservation and instruction alignment on the COMPIE benchmark. Note that we demonstrate the flexibility of X-Planner where we train based on both closed-source (GPT-4o) and open-source (Pixtral-Large) model generated datasets.

Key Findings

Decompose, Then Edit

Complex instructions confuse existing editors. X-Planner decomposes them into interpretable sub-instructions with precise masks and bounding boxes, enabling multi-object, multi-step edits that preserve object identity.

Significant Quality Gains

Integrating X-Planner with UltraEdit and InstructPix2Pix* substantially improves instruction alignment and identity preservation on the COMPIE benchmark, validated by both MLLM-based metrics and user studies.

Flexible Training Data

X-Planner can be trained on datasets generated by both closed-source (GPT-4o) and open-source (Pixtral-Large) models, demonstrating broad flexibility without dependence on proprietary APIs.

How Does the X-Planner Work?

X-Planner is composed of a MLLM and a segmentation decoder, which together decompose complex instructions into sub-tasks, and generate precise segmentation masks or bounding boxes. By dynamically selecting from a pool of specialized editing models, X-Planner performs localized edits in an iterative and interpretable way, enabling fine-grained, multi-object, and multi-step visual transformations with minimal manual effort.

Figure 2. Overview of the X-Planner framework. A MLLM decomposes complex instructions into sub-tasks, generates segmentation masks or bounding boxes, and dynamically selects specialized editing models for each sub-task.

How Do We Prepare the Training Data to Learn the X-Planner?

Our framework is organized into a three-level pipeline for generating high-quality instruction-editing data.

[Level 1] focuses on generating complex instruction/simple instruction pairs using a structured GPT-4o prompting template. These include indirect, multi-object, and multi-step edits, each annotated with object anchors and associated edit types.
[Level 2] handles mask generation: given a source image and anchor text, we use Grounded SAM to extract fine-grained object masks, which are then refined based on the specific edit type using tailored strategies.
[Level 3] addresses insertion tasks, where the object to be inserted does not exist in the original image. Here, we pre-train an MLLM on bounding box-annotated data to localize pseudo bounding boxes for such objects, ensuring the instruction is grounded even when the visual evidence is absent.

Figure 3. Three-level training data generation pipeline. Level 1: complex/simple instruction pair generation. Level 2: mask extraction and refinement. Level 3: bounding box localization for insertion tasks.

Qualitative Comparison

Integrating X-Planner with InstructPix2Pix* and UltraEdit significantly improves object identity preservation and instruction alignment by leveraging generated masks and bounding boxes (displayed in the bottom-left of each image). Unlike baselines relying solely on complex prompts without masks, X-Planner's decomposition enables more precise and interpretable edits.

Figure 4. Qualitative comparison of X-Planner integrated with editing models versus baselines. Click the arrows or dots to browse more results.

Quantitative Comparison on COMPIE Benchmark

X-Planner boosts UltraEdit and InstructPix2Pix* by breaking down complex instructions and supplying control signals like masks. To address CLIP_out's limitations on complex edits, we use an MLLM-based (InternVL2-76B) metric to better showcase X-Planner's effectiveness. Note that we demonstrate the flexibility of X-Planner where we train our X-Planner based on both closed-source (GPT-4o) and open-source (Pixtral-Large) model generated dataset.

Methods	Guidance Control	L1 ↓	CLIP_im ↑	CLIP_out ↑	DINO ↑	MLLM_ti ↑	MLLM_im ↑
SmartEdit	No	0.2764	0.7713	0.2512	0.6044	0.6511	0.5347
MGIE	No	0.2988	0.7692	0.2498	0.5981	0.6408	0.5288
UltraEdit Backbone
UltraEdit (Baseline)	No	0.1292	0.7688	0.2698	0.6387	0.6652	0.5523
GenArtist + UltraEdit	Mask + Decomp. Instr.	0.1253	0.7767	0.2621	0.6435	0.6894	0.5593
X-Planner + UltraEdit	Decomp. Instr. Only	0.1253	0.7767	0.2621	0.6435	0.6894	0.5593
X-Planner + UltraEdit	Mask + Decomp. Instr.	0.1188	0.7875	0.2569	0.6599	0.7061	0.5744
InstructPix2Pix* Backbone
InstructPix2Pix* (Baseline)	No	0.1517	0.8020	0.2666	0.6988	0.6727	0.6160
GenArtist + IP2P*	Mask + Decomp. Instr.	0.1458	0.8143	0.2641	0.7114	0.7072	0.6277
X-Planner + IP2P*	Decomp. Instr. Only	0.1458	0.8143	0.2641	0.7114	0.7072	0.6277
**X-Planner + IP2P***	Mask + Decomp. Instr.	0.1320	0.8285	0.2591	0.7068	0.7408	0.6454

Table 1. Quantitative comparison on the COMPIE benchmark. X-Planner with mask + decomposed instructions significantly improves editing performance across both UltraEdit and InstructPix2Pix* backbones.

Consistent Bounding Box Across Repeated Runs

We demonstrate that X-Planner consistently generates plausible and semantically meaningful bounding boxes across repeated runs for insertion tasks. While maintaining alignment with the instruction, the model introduces natural location variations, such as different positions for objects like unicorns, laptops, or palm trees.

Bounding box consistency across repeated runs

Figure 5. Bounding box predictions across repeated runs show consistent, semantically meaningful placement with natural variation.

User Study on COMPIE Benchmark

We compare against InstructPix2Pix* and UltraEdit. "Better" means the generated images by using our X-Planner is preferred and vice versa.

Figure 6. User study comparing X-Planner-augmented editing against baselines. X-Planner is strongly preferred by human evaluators.

BibTeX

@inproceedings{yeh2026beyond,
  title={Beyond simple edits: X-planner for complex instruction-based image editing},
  author={Yeh, Chun-Hsiao and Wang, Yilin and Zhao, Nanxuan and Zhang, Richard and Li, Yuheng and Ma, Yi and Singh, Krishna Kumar},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={40},
  number={14},
  pages={11991--11999},
  year={2026}
}

Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing