VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation
Abstract
Visual autoregressive models face training instability due to asynchronous policy conflicts, which are addressed through a novel framework enhancing group relative policy optimization with intermediate rewards, dynamic time-step reweighting, and mask propagation algorithms.
Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective (2025)
- SuperFlow: Training Flow Matching Models with RL on the Fly (2025)
- Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning (2025)
- Accelerating Inference of Masked Image Generators via Reinforcement Learning (2025)
- WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving (2025)
- MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation (2025)
- ESPO: Entropy Importance Sampling Policy Optimization (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
