«`html

DanceGRPO: A Unified Framework for Reinforcement Learning in Visual Generation Across Multiple Paradigms and Tasks

Recent advancements in generative models, particularly diffusion models and rectified flows, have significantly improved visual content creation. Integrating human feedback during training is crucial for aligning outputs with human preferences and aesthetic standards. However, current methods, such as ReFL, encounter VRAM inefficiencies in video generation, while DPO variants yield only marginal visual improvements.

Reinforcement Learning from Human Feedback (RLHF) is employed to align large language models (LLMs) by training reward functions based on comparison data. Policy gradient methods are effective but computationally intensive, requiring extensive tuning. In contrast, Direct Policy Optimization (DPO) offers cost efficiency but often results in inferior performance. Recent initiatives, like DeepSeek-R1, indicate that large-scale RL with specialized reward functions can guide LLMs toward self-emergent thought processes. Existing approaches include DPO-style methods, direct backpropagation with reward signals, and policy gradient-based methods such as DPOK and DDPO. Production models primarily utilize DPO and ReFL due to the instability of policy gradient methods in large-scale applications.

Researchers from ByteDance Seed and the University of Hong Kong have introduced DanceGRPO, a unified framework that adapts Group Relative Policy Optimization for visual generation across various paradigms. This solution seamlessly operates with diffusion models and rectified flows, facilitating tasks such as text-to-image, text-to-video, and image-to-video generation. The framework integrates with four foundation models (Stable Diffusion, HunyuanVideo, FLUX, SkyReels-I2V) and five reward models focusing on image/video aesthetics, text-image alignment, video motion quality, and binary reward assessments.

DanceGRPO has demonstrated performance improvements over baseline models by up to 181% on key benchmarks, including HPS-v2.1, CLIP Score, VideoAlign, and GenEval. The architecture utilizes five specialized reward models to enhance visual generation quality:

Image Aesthetics: Quantifies visual appeal using models fine-tuned on human-rated data.
Text-image Alignment: Utilizes CLIP to maximize cross-modal consistency.
Video Aesthetics Quality: Evaluates temporal domains with Vision Language Models (VLMs).
Video Motion Quality: Assesses motion realism through physics-aware VLM analysis.
Thresholding Binary Reward: Implements a discretization mechanism to evaluate generative models’ ability to learn under threshold-based optimization.

DanceGRPO has shown notable improvements in reward metrics for Stable Diffusion v1.4, with an increase in the HPS score from 0.239 to 0.365 and the CLIP Score from 0.363 to 0.395. Evaluation metrics from Pick-a-Pic and GenEval confirm its effectiveness, with DanceGRPO outperforming all competing methods. For HunyuanVideo-T2I, optimization utilizing the HPS-v2.1 model raised the mean reward score from 0.23 to 0.33, enhancing alignment with human aesthetic preferences. Despite excluding text-video alignment for stability reasons, the methodology achieved relative improvements of 56% and 181% in visual and motion quality metrics, respectively. DanceGRPO’s VideoAlign reward model achieved a substantial 91% relative improvement in motion quality.

In summary, DanceGRPO offers a unified framework designed to enhance diffusion models and rectified flows across various tasks. It effectively addresses critical limitations of prior methods by bridging the gap between language and visual modalities, achieving superior performance through efficient alignment with human preferences and robust scaling to complex, multi-task environments. Future work will focus on extending GRPO to multimodal generation, further unifying optimization paradigms in Generative AI.

For further insights, check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Follow us on Twitter and join our community of over 90,000 members on ML SubReddit.

«`