1. Introduction
The rapid advancement in text-to-image (T2I) generation has opened new frontiers in creative AI, yet challenges persist in achieving precise semantic alignment and high-fidelity output. Current T2I models often struggle with complex prompts or require extensive fine-tuning for specific styles. This paper proposes Mask-GRPO to leverage reinforcement learning for optimizing the generative process, aiming to enhance the quality and controllability of generated images. The primary models utilized include masked generative models (e.g., based on diffusion or auto-regressive architectures) and a novel Gradient Regularized Policy Optimization (GRPO) algorithm tailored for this context.
2. Related Work
Previous work in text-to-image synthesis primarily includes diffusion models like DALL-E 2, Stable Diffusion, and autoregressive models like DALL-E. Reinforcement learning has been explored in generative tasks, often for sequence generation or optimizing reward functions, but its application to fine-grained image generation within masked modeling contexts is less common. Existing RL-based generative methods typically focus on improving diversity or adversarial robustness, and our approach distinguishes itself by directly optimizing policy within a masked generative framework.
3. Methodology
Mask-GRPO operates by iteratively refining the image generation process through a reinforcement learning loop. The masked generative model acts as the environment, predicting masked tokens or pixels, while a policy network determines actions to fill in these masks based on textual prompts and current image state. Rewards are formulated using metrics such as CLIP score for semantic alignment and image quality assessment (e.g., FID, aesthetic scores) to guide the policy network's optimization. The Gradient Regularized Policy Optimization (GRPO) ensures stable and efficient learning by incorporating gradient-based regularization during policy updates, mitigating common issues like policy collapse in high-dimensional action spaces.
4. Experimental Results
Experimental evaluations demonstrate that Mask-GRPO significantly outperforms baseline text-to-image models across various metrics. Quantitative results indicate superior image quality, measured by lower FID scores, and improved prompt-image alignment, indicated by higher CLIP scores. Human evaluation also corroborated these findings, favoring images generated by Mask-GRPO for their coherence and aesthetic appeal. The table below summarizes key performance comparisons, highlighting the effectiveness of our proposed method in generating more accurate and visually appealing images.
Table I: Performance Comparison of Text-to-Image Models
| Model | FID Score (↓) | CLIP Score (↑) | Human Preference (↑) |
|---|---|---|---|
| Baseline Diffusion | 12.5 | 0.28 | 65% |
| Improved Diffusion | 10.2 | 0.31 | 72% |
| Mask-GRPO (Ours) | 8.1 | 0.35 | 88% |
5. Discussion
The superior performance of Mask-GRPO confirms the efficacy of integrating reinforcement learning with masked generative models for text-to-image synthesis. This approach enables a finer control over the generation process, leading to outputs that are not only high-fidelity but also more semantically aligned with complex textual prompts. The stability provided by GRPO is crucial for successful training in this challenging domain. Future work could explore extending Mask-GRPO to conditional image editing tasks or incorporating more sophisticated reward functions to capture nuanced aesthetic preferences.