PhyCustom: Towards Realistic Physical Customization in Text-to-Image Generation

Jian Li Wei Chen Meng Wu Xu Yang
Institute of Artificial Intelligence, National University of Science and Technology

Abstract

This paper introduces PhyCustom, a novel framework designed to enhance the physical realism and customizability of generated images in text-to-image models. By integrating explicit physical properties into the generation process, PhyCustom allows users to precisely control material properties, lighting conditions, and object interactions. Experimental results demonstrate that PhyCustom significantly improves the perceptual quality and physical consistency of synthesized images, addressing a critical gap in current generative AI capabilities.

Keywords

Text-to-Image Generation, Physical Customization, Realistic Rendering, Diffusion Models, Generative AI


1. Introduction

Recent advancements in text-to-image generation have enabled the creation of highly diverse and artistic images, but often lack precise control over physical attributes such as material texture, reflectance, and lighting. This limitation hinders the deployment of these models in applications requiring high fidelity and physical accuracy. PhyCustom aims to bridge this gap by offering a method for realistic physical customization. Models used include Latent Diffusion Models (LDMs) and a Physics-based Rendering (PBR) component.

2. Related Work

Previous work in text-to-image synthesis primarily focuses on semantic content and stylistic generation, with limited emphasis on explicit physical realism. While some methods allow for style transfer or simple object manipulation, they generally do not incorporate detailed physical properties or interactions. This section reviews existing diffusion models, controllable generation techniques, and the challenges in integrating physical realism into generative frameworks.

3. Methodology

PhyCustom's methodology involves a multi-stage process that integrates physical attribute encoding with a modified diffusion model architecture. It leverages a dedicated module to interpret user-specified physical parameters (e.g., roughness, metallicity, light direction) and injects this information into the latent space of the diffusion model through cross-attention mechanisms. The model is trained on a synthetic dataset augmented with physical property maps, ensuring robust learning of physically plausible image characteristics.

4. Experimental Results

Experiments were conducted to evaluate PhyCustom's ability to generate images with controllable and realistic physical properties. Quantitative metrics like FID and user study results indicated a significant improvement in physical realism and user preference compared to baseline models. For instance, PhyCustom achieved a 25% higher realism score in human evaluations and a 15% reduction in physical inconsistency.Explanation of Table: The table below illustrates the performance comparison of PhyCustom against various baseline text-to-image models across key metrics such as FID, CLIP Score, and a custom Physical Realism Score derived from human evaluation. PhyCustom consistently outperforms baselines, particularly in physical realism, confirming its efficacy in generating images with accurate physical attributes.

ModelFID (↓)CLIP Score (↑)Physical Realism Score (↑)
Baseline Diffusion12.50.283.2
ControlNet10.10.313.8
PhyCustom8.20.354.5

5. Discussion

The superior performance of PhyCustom underscores the importance of integrating explicit physical understanding into text-to-image generation for achieving higher realism and control. Our findings suggest that current generative models can significantly benefit from physics-aware training and architectural designs, opening new avenues for applications in design, simulation, and virtual reality. Future work will explore real-time customization and the incorporation of more complex physical phenomena like fluid dynamics or soft body interactions.