|
基于布局控制的文本到图像扩散模型研究进展
|
Abstract:
随着计算机视觉和生成模型的迅猛发展,布局到图像生成(Layout-to-Image Generation)已成为一个重要的研究方向。该任务通过提供物体的空间布局信息,如边界框位置和类别标签,生成符合该布局要求的真实图像。近年来,扩散模型作为一种新兴的生成技术,凭借其在图像生成中的独特优势,逐渐成为布局到图像生成的主流方法之一。与生成对抗网络(GAN)相比,扩散模型在图像质量、稳定性和多样性方面表现出更好的性能。本文综述了近年来扩散模型在布局到图像生成中的研究进展,详细介绍了扩散模型的基本原理,并将现有的研究成果归纳为三类:1) 专用扩散模型方法;2) 基于预训练扩散模型的适配方法;3) 推理阶段的组合控制方法。本文还分析了不同布局生成方法的优缺点,并对未来可能的研究方向进行了展望。
With the rapid development of computer vision and generative models, Layout-to-Image Generation has become an important research direction. This task involves generating realistic images that conform to the given spatial layout of objects, such as bounding box positions and class labels. In recent years, diffusion models, as an emerging generative technique, have gradually become one of the main methods for Layout-to-Image Generation due to their unique advantages in image generation. Compared to Generative Adversarial Networks (GANs), diffusion models perform better in terms of image quality, stability, and diversity. This paper reviews the recent advancements of diffusion models in Layout-to-Image Generation, provides a detailed introduction to the fundamental principles of diffusion models, and categorizes the existing research into three types: 1) Dedicated diffusion model methods; 2) Adaptation methods based on pre-trained diffusion models; 3) Combination control methods during the inference stage. The paper also analyzes the advantages and disadvantages of different layout generation methods and discusses potential future research directions.
[1] | Cheng, J., Liang, X., Shi, X., et al. (2023) LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation. arXiv: 2302.08908. http://arxiv.org/abs/2302.08908 |
[2] | Zheng, G., Zhou, X., Li, X., Qi, Z., Shan, Y. and Li, X. (2023) Layoutdiffusion: Controllable Diffusion Model for Layout-To-Image Generation. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 22490-22499. https://doi.org/10.1109/cvpr52729.2023.02154 |
[3] | Cho, J., Li, L., Yang, Z., et al. (2024) Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation. arXiv: 2304.06671. http://arxiv.org/abs/2304.06671 |
[4] | Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., et al. (2014) Generative Adversarial Networks. arXiv: 1406.2661. http://arxiv.org/abs/1406.2661 |
[5] | Ashual, O. and Wolf, L. (2019) Specifying Object Attributes and Relations in Interactive Scene Generation. arXiv: 1909.05379. http://arxiv.org/abs/1909.05379 |
[6] | Johnson, J., Gupta, A. and FEI-Fei, L. (2018) Image Generation from Scene Graphs. arXiv: 1804.01622. http://arxiv.org/abs/1804.01622 |
[7] | Wang, B., Wu, T., Zhu, M., et al. (2022) Interactive Image Synthesis with Panoptic Layout Generation. arXiv: 2203.02104. http://arxiv.org/abs/2203.02104 |
[8] | Sun, W. and Wu, T. (2021) Learning Layout and Style Reconfigurable GANs for Controllable Image Synthesis. arXiv: 2003.11571. http://arxiv.org/abs/2003.11571 |
[9] | Arjovsky, M. and Bottou, L. (2017) Towards Principled Methods for Training Generative Adversarial Networks. arXiv: 1701.04862. http://arxiv.org/abs/1701.04862 |
[10] | Radford, A., Metz, L. and Chintala, S. (2016) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv: 1511.06434. http://arxiv.org/abs/1511.06434 |
[11] | Ho, J., Jain, A. and Abbeel, P. (2020) Denoising Diffusion Probabilistic Models. arXiv: 2006.11239. http://arxiv.org/abs/2006.11239 |
[12] | Dhariwal, P. and Nichol, A. (2021) Diffusion Models Beat GANs on Image Synthesis. arXiv: 2105.05233. http://arxiv.org/abs/2105.05233 |
[13] | Zhao, B., Meng, L., Yin, W., et al. (2019) Image Generation from Layout. arXiv: 1811.11389. http://arxiv.org/abs/1811.11389 |
[14] | Kingma, D.P. and Welling, M. (2014) Auto-Encoding Variational Bayes. arXiv: 1312.6114. http://arxiv.org/abs/1312.6114 |
[15] | Sun, W. and Wu, T. (2019) Image Synthesis from Reconfigurable Layout and Style. arXiv: 1908.07500. http://arxiv.org/abs/1908.07500 |
[16] | Liang, J., Pei, W. and Lu, F. (2022) Layout-Bridging Text-to-Image Synthesis. arXiv: 2208.06162. http://arxiv.org/abs/2208.06162 |
[17] | Perera, P., Nallapati, R. and Xiang, B. (2019) OCGAN: One-Class Novelty Detection Using Gans with Constrained Latent Representations. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 8576-8585. https://doi.org/10.1109/cvpr.2019.00301 |
[18] | Liu, L., Ren, Y., Lin, Z., et al. (2022) Pseudo Numerical Methods for Diffusion Models on Manifolds. arXiv: 2202.09778. http://arxiv.org/abs/2202.09778 |
[19] | Song, J., Meng, C. and Ermon, S. (2022) Denoising Diffusion Implicit Models. arXiv: 2010.02502. http://arxiv.org/abs/2010.02502 |
[20] | Ho, J. and Salimans, T. (2022) Classifier-Free Diffusion Guidance. arXiv: 2207.12598. http://arxiv.org/abs/2207.12598 |
[21] | Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., et al. (2023) GLIGEN: Open-Set Grounded Text-To-Image Generation. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 22511-22521. https://doi.org/10.1109/cvpr52729.2023.02156 |
[22] | Zhang, L., Rao, A. and Agrawala, M. (2023) Adding Conditional Control to Text-to-Image Diffusion Models. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 3813-3824. https://doi.org/10.1109/iccv51070.2023.00355 |
[23] | Wang, X., Darrell, T., Rambhatla, S.S., et al. (2024) InstanceDiffusion: Instance-Level Control for Image Generation. arXiv: 2402.03290. http://arxiv.org/abs/2402.03290 |
[24] | Wang, X., Fu, S., Huang, Q., et al. (2025) MS-Diffusion: Multi-Subject Zero-Shot Image Personalization with Layout Guidance. arXiv: 2406.07209. http://arxiv.org/abs/2406.07209 |
[25] | Balaji, Y., Nah, S., Huang, X., et al. (2023) eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv: 2211.01324. http://arxiv.org/abs/2211.01324 |
[26] | Shirakawa, T. and Uchida, S. (2024) NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging. arXiv: 2403.03485. http://arxiv.org/abs/2403.03485 |
[27] | Jiménez, Á.B. (2023) Mixture of Diffusers for Scene Composition and High Resolution Image Generation. arXiv: 2302.02412. http://arxiv.org/abs/2302.02412 |
[28] | Bar-Tal, O., Yariv, L., Lipman, Y., et al. (2023) MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation. arXiv: 2302.08113. http://arxiv.org/abs/2302.08113 |
[29] | Yang, L., Yu, Z., Meng, C., et al. (2024) Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs. arXiv: 2401.11708. http://arxiv.org/abs/2401.11708 |
[30] | Kim, Y., Lee, J., Kim, J., Ha, J. and Zhu, J. (2023) Dense Text-To-Image Generation with Attention Modulation. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 7667-7677. https://doi.org/10.1109/iccv51070.2023.00708 |
[31] | Wang, Z., Xia, X., Chen, R., et al. (2025) LaVin-DiT: Large Vision Diffusion Transformer. arXiv: 2411.11505. http://arxiv.org/abs/2411.11505 |
[32] | Chen, B., Zhang, Z., Li, W., et al. (2025) Invertible Diffusion Models for Compressed Sensing. arXiv: 2403.17006. http://arxiv.org/abs/2403.17006 |
[33] | Zhou, Y., Xiao, Z., Yang, S., et al. (2025) Alias-Free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space. arXiv: 2503.09419. http://arxiv.org/abs/2503.09419 |
[34] | Liu, Y., Zhang, K., Li, Y., et al. (2024) Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models. arXiv: 2402.17177. http://arxiv.org/abs/2402.17177 |
[35] | Xu, Z., Zhang, J., Liew, J.H., et al. (2023) MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model. arXiv: 2311.16498. http://arxiv.org/abs/2311.16498 |
[36] | Zhang, C., Wang, C., Zhang, J., et al. (2023) DREAM-Talk: Diffusion-Based Realistic Emotional Audio-Driven Method for Single Image Talking Face Generation. arXiv: 2312.13578. http://arxiv.org/abs/2312.13578 |
[37] | Ding, S., Chen, X., Fang, Y., et al. (2023) DesignGPT: Multi-Agent Collaboration in Design. arXiv: 2311.11591. http://arxiv.org/abs/2311.11591 |
[38] | Sun, W., Cui, B., Dong, X.M., et al. (2025) Attentive Eraser: Unleashing Diffusion Model’s Object Removal Potential via Self-Attention Redirection Guidance. arXiv: 2412.12974. http://arxiv.org/abs/2412.12974 |