Image fusion seeks to seamlessly integrate foreground objects with background scenes, producing realistic and harmonious fused images. Unlike existing methods that directly insert objects into the background, adaptive and interactive fusion remains a challenging yet appealing task.
To address this, we propose an iterative human-in-the-loop data generation pipeline, which leverages limited initial data with diverse textual prompts to generate fusion datasets across various scenarios and interactions, including placement, holding, wearing, and style transfer.
Building on this, we introduce DreamFuse, a novel approach based on the Diffusion Transformer (DiT) model, to generate consistent and harmonious fused images with both foreground and background information. DreamFuse employs a Positional Affine mechanism and uses Localized Direct Preference Optimization guided by human feedback to refine the result. Experimental results show that DreamFuse outperforms SOTA across multiple metrics.
The framework of the data generation model and position matching process. The left side of the image illustrates the design structure of our data generation model, while the right side shows the position matching process and data format. We enhance the diversity of fused data generation through flexible and rich prompts combined with various style LoRAs.
The framework of the DreamFuse. We apply positional affine transformations to map the foreground's position and size onto the background. The foreground and background are concatenated with the noisy fused image as condition images before DiT's attention layers. Localized direct preference optimization is then used to improve background consistency and foreground harmony.
@article{huang_dreamfuse,
title = {DreamFuse: Adaptive Image Fusion with Diffusion Transformer},
author = {Junjia Huang, Pengxiang Yan, Jiyang Liu, Jie Wu, Zhao Wang, Yitong Wang, Liang Lin, Guanbin Li},
journal = {arXiv preprint arXiv:2504.08291},
year = {2025}
}