IMAGHarmony

IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout

¹ Nanjing University of Science and Technology ² Tsinghua University ³ University of California

Abstract

Recent diffusion models have advanced image editing by enhancing visual quality and control, supporting broad applications across creative and personalized domains. However, current image editing largely overlooks multi-object scenarios, where precise control over object categories, counts, and spatial layouts remains a significant challenge. To address this, we introduce a new task, quantity-and-layout consistent image editing (QL-Edit), which aims to enable fine-grained control of object quantity and spatial structure in complex scenes. We further propose IMAGHarmony, a structure-aware framework that incorporates harmony-aware attention (HA) to integrate multimodal semantics, explicitly modeling object counts and layouts to enhance editing accuracy and structural consistency. In addition, we observe that diffusion models are susceptible to initial noise and exhibit strong preferences for specific noise patterns. Motivated by this, we present a preference-guided noise selection (PNS) strategy that chooses semantically aligned initial noise samples based on vision-language matching, thereby improving generation stability and layout consistency in multi-object editing. To support evaluation, we construct HarmonyBench, a comprehensive benchmark covering diverse quantity and layout control scenarios. Extensive experiments demonstrate that IMAGHarmony consistently outperforms state-of-the-art methods in structural alignment and semantic accuracy. Code, models, and datasets will be publicly released.

Method

To address the challenge of controlling object counts and spatial layouts in multi-object image editing, we propose IMAGHarmony, a dedicated framework for the QL-Edit task. IMAGHarmony is built upon the denoising UNet architecture of frozen Stable Diffusion XL (SDXL), and incorporates the proposed harmony-aware attention (HA). Additionally, we present a preference-guided noise selection (PNS) strategy at inference time to enhance structural stability and semantic consistency. Specifically, during training, we first generate multiple initial noise seeds and use SDXL to perform denoising conditioned on the given textual prompt, producing a set of candidate images. These candidates are then evaluated by a frozen vision-language model (VLM) for semantic alignment, and the top-k scoring images along with their corresponding seeds are selected for further processing. The auxiliary text and source image are respectively encoded using frozen text and image encoders to extract conditional text and visual features, which are then fed into the HA module to obtain harmony feature that explicitly model object count and implicitly capture layout. Subsequently, a learnable IP attention layer is inserted into the "Down4" block of the UNet to inject the harmony feature. To preserve the generative capability of the original model, we additionally apply cross-attention using a \texttt{“< Null Prompt >”} over the frozen UNet. During inference, we adopt the PNS strategy to perform a few denoising steps on the candidate seeds, evaluate their semantic consistency, and ultimately select the optimal noise seed to conduct full denoising, resulting in high-quality edited images that align with the desired object count and spatial layout. Finally, we provide a comprehensive summary of the training and inference strategies implemented in IMAGHarmony .

HarmonyBench Dataset Demo

GarmentBench Dataset

The dataset contains 200 image-caption pairs, systematically organized by object count from 1 to 20, with 10 diverse layout examples per count level to ensure broad coverage of numerical and spatial variations.

IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout

Abstract

Motivation

Method

Example

HarmonyBench Dataset Demo

GarmentBench Dataset

Citation Information