Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

Awesome RL-VLA for Robotic Manipulation 🤖

[Paper]

A curated list of papers and resources on Reinforcement Learning of Vision-Language-Action (RL-VLA) models for Robotic Manipulation. This repository provides a comprehensive overview of training paradigms, methodologies, and state-of-the-art approaches in RL-VLA research.

📢 Latest News

🔥 [November 2025] Our comprehensive survey paper "A Survey on Reinforcement Learning of Vision-Language-Action Models for Robotic Manipulation" is now available on TechRxiv! Stay tuned for future updates.

📖 Table of Contents

🔍 Overview

RL training is crucial for enabling VLAs to generalize out-of-distribution (OOD) from large-scale pre-trained data. Existing RL-VLA training paradigms can be categorized into three types based on how agents obtain and utilize feedback from the environment:

  • Online RL-VLA: Direct interaction with the environment during training
  • Offline RL-VLA: Learning from static datasets without further environmental interaction
  • Test-time RL-VLA: Models adapt their behavior during deployment without altering parameters

🚀 Training Paradigms

Offline RL-VLA

Offline RL trains VLA models on pre-collected static datasets, enabling learning independently from environment interactions. This paradigm is suitable for high-risk or resource-constrained deployment scenarios.

Key Research Directions:

  • Data Utilization: Effective utilization of static datasets for policy improvement
  • Objective Modification: Customizing RL objectives for novel architectures and data augmentation

Online RL-VLA

Online RL-VLA enables interactive policy learning through continuous environment interaction, empowering pre-trained VLAs with adaptive closed-loop control capability for real-world OOD environments.

Key Research Directions:

  • Policy Optimization: Direct policy improvement based on environmental rewards
  • Sample Efficiency: Learning effective policies with limited interaction budget
  • Active Exploration: Efficient exploration strategies for higher performance gains
  • Training Stability: Ensuring consistent policy updates and convergence
  • Infrastructure: Scalable frameworks for online RL-VLA training

Test-time RL-VLA

Test-time RL-VLA adapts behavior during deployment through lightweight updates, addressing the expensive cost of full model fine-tuning in real-world scenarios.

Key Adaptation Mechanisms:

  • Value Guidance: Using pre-trained value functions to influence action selection
  • Memory Buffer Guidance: Retrieving relevant historical experiences during inference
  • Planning-guided Adaptation: Explicit reasoning over future action sequences

📚 Paper Collection

Legend

  • Action: AR (Autoregressive), Diffusion, Flow (Flow-matching)
  • Reward: D (Dense Reward), S (Sparse Reward)
  • Model Type: MB (Model-based), MF (Model-free)
  • Environment: Sim. (Simulation), Real (Real-world)
  • Task: MT (Multi-task), ST (Single-task)

Offline RL-VLA

MethodDatePublicationSim.RealBase VLA ModelActionRewardAlgorithmTypeProject
Q-Transformer2023.10CoRL23🔗TransformerARSCQLMF🔗
PAC2024.02ICML24🔗Perceiver-Actor-CriticARSACMF🔗
GeRM(Quadruped Robot)2024.03IROS24🔗Transformer-MoEARSCQLMF🔗
MoRE(Quadruped Robot)2025.03ICRA25🔗MLLM-MoEARSCQLMF-
ReinboT2025.05ICML25🔗ReinboTARDDT + RTGMF🔗
CO-RFT2025.08-RoboVLMsARDCal-QL + TD3MF-
ARFM2025.09AAAI26🔗π₀FlowDARFMMF-
$π^*_{0.6}$2025.11-$π_{0.6}$FlowDRECAPMF🔗
NORA-1.52025.11-NORA-1.5AR / FlowDDPOMB🔗
GigaBrain-0.5M*2026.2-GigaBrain-0.5FlowDRAMPMB🔗
ARM2026.4-GR00T N1.5FlowDAW-BCMF🔗

Online RL-VLA

MethodDatePublicationSim.RealBase VLA ModelActionRewardAlgorithmTypeProject
FLaRe2024.09ICRA25🔗✓ (ST)✓ (ST)SPOCARSPPOMF🔗
PA-RL2024.12ICLR25 Workshop🔗✓ (ST)✓ (ST)OpenVLAARSPA-RLMF🔗
RLDG2024.12RSS25🔗✓ (ST)OpenVLA / OctoAR / DiffusionSRLPDMF🔗
iRe-VLA2025.01ICRA25🔗✓ (MT)✓ (MT)iRe-VLAARSSACfD + SFTMF-
GRAPE2025.02ICRA25 Poster🔗✓ (MT)✓ (MT)OpenVLAARDTPOMF🔗
SafeVLA2025.03NeurIPS25 Poster🔗✓ (ST)SPOCARSPPOMF🔗
RIPT-VLA2025.05-✓ (MT)QueST / OpenVLA-OFTARSLOOPMF🔗
VLA-RL2025.05-✓ (MT)OpenVLAARDPPOMF🔗
RLVLA2025.05NeurIPS25 Poster🔗✓ (MT)OpenVLAARSPPO / GRPO / DPOMF🔗
RFTF2025.05-✓ (MT)GR-MG, SeerARDPPOMF-
TGRPO2025.06-✓ (ST)OpenVLAARDGRPOMF-
RLRC2025.06-✓ (MT)OpenVLAARSPPOMF🔗
ThinkAct2025.07NeurIPS25 Poster🔗✓ (MT)MLLM + DiTAR / DiffusionDGRPO (System 2)MF🔗
SimpleVLA-RL2025.09ICLR26 Poster🔗✓ (MT)✓ (ST)OpenVLA-OFTARSGRPOMF🔗
Dual-Actor FT2025.09IROS25 Workshop Extended Abstract🔗✓ (MT)✓ (MT)Octo / SmolVLADiffusionSQL + BCMF🔗
Generalist2025.09NeurIPS25 Poster🔗✓ (MT)✓ (MT)PaLI 3BARDREINFORCEMF🔗
VLAC2025.09-✓ (MT)VLACARDPPOMF🔗
Robo-Dopamine2025.12CVPR26🔗✓ (MT)✓ (MT)Pi0.5FlowDPPOMF🔗
AC PPO2025.09-✓ (ST)Octo-smallARSPPO+BCMF-
VLA-RFT2025.10-✓ (MT)VLA-AdapterFlowDGRPOMB🔗
RLinf-VLA2025.10-✓ (MT)✓ (MT)OpenVLA / OpenVLA-OFTARSPPO / GRPOMF🔗
FPO2025.10-✓ (MT)π₀FlowSFPOMF-
ReSA2025.10-✓ (MT)OpenVLAARDPPO + SFTMF-
π_RL2025.10-✓ (MT)π₀ / π₀.₅FlowSPPO / GRPOMF🔗
PLD2025.10ICLR26 Poster🔗✓ (MT)✓ (MT)OpenVLA / π₀ / OctoAR / FlowSCal-QL + SACMF🔗
DeepThinkVLA2025.10-✓ (MT)π₀-FastARSGRPOMF🔗
World-Env2025.11-✓ (ST)✓ (ST)OpenVLA-OFTARDPPOMB🔗
RobustVLA2025.11-✓ (MT)OpenVLA-OFTARDPPOMF-
WMPO2025.11ICLR26 Poster🔗✓ (MT)✓ (MT)OpenVLA-OFTARSGRPOMB🔗
ProphRL2025.11-✓ (ST)✓ (ST)VLA-Adapter / π0.5 / OpenVLA-OFT(flow action)FlowSFA-GRPOMB🔗
EVOLVE-VLA2025.12-✓ (MT)OpenVLA-OFTARDGRPOMB(VLAC)🔗
SOP2026.1-✓ (MT)π0.5FlowSHG-DAgger / RECAPMF🔗
Green-VLA2026.1-✓ (MT)✓ (MT)Green-VLAFlowSIQL + actor-criticMF🔗
SA-VLA2026.1-✓ (MT)π0.5FlowDPPOMF🔗
World-Gymnast2026.2ICLR26 Workshop🔗✓ (MT)✓ (MT)OpenVLA-OFTARSGRPOMB🔗
RL-VLA32026.2ICLR26 Workshop🔗✓ (MT)π0 / π0.5 / GR00T N1.5 / OpenVLA-OFTFlow / ARS-MF
World-VLA-Loop2026.2-✓ (ST)✓ (ST)OpenVLA-OFTARSGRPOMB🔗
RISE2026.2-✓ (ST)π0.5FlowDRISEMB🔗
WoVR2026.2-✓ (MT)✓ (MT)OpenVLA-OFTARSGRPOMB🔗
ALOE2026.2-✓ (ST)π₀.₅FlowSAWR(Advantage-Weighted Regression)MF🔗
TwinRL-VLA2026.2-✓ (ST)OctoDiffusionSActor-CriticMF
RL-Co2026.3-✓ (ST)✓ (ST)OpenVLA / π0.5AR / FlowDReinFlow / GRPOMF
π_StepNFT2026.3-✓ (MT)π₀ / π₀.₅FlowSNFTMF🔗
ROBOMETER2026.3-✓ (MT)π₀FlowDDSRLMF🔗
AtomVLA2026.3-✓ (MT)✓ (ST)AtomVLAFlowDGRPOMB
NS-VLA2026.3-✓ (MT)NS-VLAARDGRPOMF🔗

Offline + Online RL-VLA

MethodDatePublicationSim.RealBase VLA ModelActionRewardAlgorithmTypeProject
ConRFT2025.4RSS26🔗Octo-smallDiffusionSCal-QL + BCMF🔗
DiffusionRL-VLA2025.9-π₀FlowSPPO(DP) + BC(VLA)MF-
SRPO2025.11-OpenVLA* / π₀ / π₀-FastAR / FlowDSRPOMF (MB-Reward but MF-RL)🔗
DLR2025.11-π₀ / OpenVLAFlow / ARSPPO(MLP) + SFT(VLA)MF-
GR-RL2025.12-GR-3FlowSTD3 / DSRLMF🔗
STARE-VLA2025.12-OpenVLA / π₀.₅AR / FlowDPPO / TPO / SFTMF🔗
IG-RFT2026.2-π₀.₅FlowDIG-AWRMF

Test-time RL-VLA

MethodDatePublicationSim.RealBase VLA ModelActionRewardAlgorithmTypeProject
V-GPS2024.10CoRL25🔗✓(MT)✓(MT)Octo / RT-1 / OpenVLAAR / DiffusionDCal-QLMF🔗
Hume2025.5-✓(MT)✓(MT)HumeFlowSValue GuidanceMF🔗
DSRL2025.6CoRL25🔗✓(MT)✓(MT)DP / π₀Diffusion / FlowSDiffusion SteeringMF🔗
VLA-Reasoner2025.9ICRA26🔗✓(ST)✓(ST)OpenVLA / SpatialVLA / π₀-FastAR / DiffusionDMCTSMB🔗
VLAPS2025.11CoRL25 Workshop🔗✓(ST)OctoDiffusionSMCTSMB🔗
VLA-Pilot2025.11-✓(ST)DiVLA / RDTAR / DiffusionDValue GuidanceTMB(MLLM)🔗
TACO2025.12-✓(ST)π₀ / OpenVLA et al.FlowSCNF estimationMF🔗
TT-VLA2026.1-✓(ST)✓(ST)Nora / OpenVLA / TraceVLAARDPPO (Value-free)MF-
VLS2026.2-✓(MT)✓(MT)OpenVLA / π₀ / π₀.₅FlowDgradient-based steerMB(VLM)🔗

Note: The 🔗 symbol in the Project column indicates papers with available project pages, GitHub repositories, or demo websites.

🔗 Useful Resources

🎯 RL-VLA Action Optimization

Different VLA architectures require distinct RL optimization strategies based on their action generation mechanisms:

RL-VLA Action Optimization
  • 🔤 Autoregressive VLA: Optimizes actions at the token-level. Each action token is individually optimized through RL, enabling fine-grained control over action sequences but requiring careful handling of sequential dependencies.

  • 🌊 Generative VLA (Diffusion/Flow): Optimizes along the action generation process at the sequence-level. The entire action trajectory is optimized as a cohesive unit through the denoising or flow-matching process, providing holistic action optimization.

  • 🔗 Dual-system VLA: Optimizes at the bridge-level. RL decides which high-level action proposal to pass to the fast controller, creating a hierarchical optimization approach that complements both token-level and sequence-level methods.

Base VLA Models

  • GR00T-N1 - NVIDIA series
  • π0 - PI series
  • OpenVLA - Open-source VLA model
  • Octo - Generalist robot policy
  • RT-1 - Robotics Transformer

Datasets & Benchmarks

  • Open X-Embodiment - Large-scale robotic datasets
  • LIBERO - Benchmark for lifelong robot learning
  • SimplerEnv - Benchmark for real-sim robot learning
  • RoboTwin - Benchmark for bimanual robot learning
  • DeepPHY - Benchmark for physical reasoning

Frameworks & Tools

  • RLinf - Infrastructure for online RL fine-tuning of VLAs
  • RLinfv0.2 - Infrastructure for real world RL

🤝 Contributing

We welcome contributions to this awesome list! Please feel free to:

  1. Add new papers: Submit a PR with new RL-VLA papers following the existing format
  2. Update information: Correct any errors or update paper information
  3. Suggest improvements: Propose better organization or additional sections

Contribution Guidelines

  • Ensure papers are relevant to RL-VLA research
  • Include paper links, project pages (if available), and key details
  • Follow the existing table format for consistency
  • Add a brief description for new paradigms or significant methodological contributions

📄 Citation

If you find this repository useful, please consider citing:

@article{pine2025rlvla, title={A Survey on Reinforcement Learning of Vision-Language-Action Models for Robotic Manipulation}, author={Haoyuan Deng, Zhenyu Wu, Haichao Liu, Wenkai Guo, Yuquan Xue, Ziyu Shan, Chuanrui Zhang, Bofang Jia, Yuan Ling, Guanxing Lu, and Ziwei Wang}, journal={TechRxiv}, year={2025}, doi={10.36227/techrxiv.176531955.54563920/v1}, note={Preprint} }

⭐ Star History

Star this repository if you find it helpful!

Star History Chart

关于 About

A Survey on Reinforcement Learning of Vision-Language-Action Models for Robotic Manipulation

语言 Languages

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
71
Total Commits
峰值: 11次/周
Less
More

核心贡献者 Contributors