Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

Reinforcement Learning Papers

PRs Welcome

Related papers for Reinforcement Learning (we mainly focus on single-agent).

Since there are tens of thousands of new papers on reinforcement learning at each conference every year, we are only able to list those we read and consider as insightful.

We have added some ICLR22, ICML22, NeurIPS22, ICLR23, ICML23, NeurIPS23, ICLR24, ICML24, NeurIPS24, ICLR25, ICML25, NeurIPS25, ICLR26 papers on RL.

Contents

Model Free (Online) RL

Classic Methods

TitleMethodConferenceon/off policyAction SpacePolicyDescription
Human-level control through deep reinforcement learning, [other link]DQNNature15offDiscretebased on value functionuse deep neural network to train q learning and reach the human level in the Atari games; mainly two trick: replay buffer for improving sample efficiency, decouple target network and behavior network
Deep reinforcement learning with double q-learningDouble DQNAAAI16offDiscretebased on value functionfind that the Q function in DQN may overestimate; decouple calculating q function and choosing action with two neural networks
Dueling network architectures for deep reinforcement learningDueling DQNICML16offDiscretebased on value functionuse the same neural network to approximate q function and value function for calculating advantage function
Prioritized Experience ReplayPriority SamplingICLR16offDiscretebased on value functiongive different weights to the samples in the replay buffer (e.g. TD error)
Rainbow: Combining Improvements in Deep Reinforcement LearningRainbowAAAI18offDiscretebased on value functioncombine different improvements to DQN: Double DQN, Dueling DQN, Priority Sampling, Multi-step learning, Distributional RL, Noisy Nets
Policy Gradient Methods for Reinforcement Learning with Function ApproximationPGNeurIPS99on/offContinuous or Discretefunction approximationpropose Policy Gradient Theorem: how to calculate the gradient of the expected cumulative return to policy
----AC/A2C----on/offContinuous or Discreteparameterized neural networkAC: replace the return in PG with q function approximator to reduce variance; A2C: replace the q function in AC with advantage function to reduce variance
Asynchronous Methods for Deep Reinforcement LearningA3CICML16on/offContinuous or Discreteparameterized neural networkpropose three tricks to improve performance: (i) use different agents to interact with the environment; (ii) value function and policy share network parameters; (iii) modify the loss function (mse of value function + pg loss + policy entropy)
Trust Region Policy OptimizationTRPOICML15onContinuous or Discreteparameterized neural networkintroduce trust region to policy optimization for guaranteed monotonic improvement
Proximal Policy Optimization AlgorithmsPPOarxiv17onContinuous or Discreteparameterized neural networkreplace the hard constraint of TRPO with a penalty by clipping the coefficient
Deterministic Policy Gradient AlgorithmsDPGICML14offContinuousfunction approximationconsider deterministic policy for continuous action space and prove Deterministic Policy Gradient Theorem; use a stochastic behaviour policy for encouraging exploration
Continuous Control with Deep Reinforcement LearningDDPGICLR16offContinuousparameterized neural networkadapt the ideas of DQN to DPG: (i) deep neural network function approximators, (ii) replay buffer, (iii) fix the target q function at each epoch
Addressing Function Approximation Error in Actor-Critic MethodsTD3ICML18offContinuousparameterized neural networkadapt the ideas of Double DQN to DDPG: taking the minimum value between a pair of critics to limit overestimation
Reinforcement Learning with Deep Energy-Based PoliciesSQLICML17offmain for Continuousparameterized neural networkconsider max-entropy rl and propose soft q iteration as well as soft q learning
Soft Actor-Critic Algorithms and Applications, Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, [appendix]SACICML18offmain for Continuousparameterized neural networkbase the theoretical analysis of SQL and extend soft q iteration (soft q evaluation + soft q improvement); reparameterize the policy and use two parameterized value functions; propose SAC

Exploration

TitleMethodConferenceDescription
Curiosity-driven Exploration by Self-supervised PredictionICMICML17propose that curiosity can serve as an intrinsic reward signal to enable the agent to explore its environment and learn skills when rewards are sparse; formulate curiosity as the error in an agent’s ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model
Curiosity-Driven Exploration via Latent Bayesian SurpriseLBSAAAI22apply Bayesian surprise in a latent space representing the agent’s current understanding of the dynamics of the system
Automatic Intrinsic Reward Shaping for Exploration in Deep Reinforcement LearningAIRSICML23select shaping function from a predefined set based on the estimated task return in real-time, providing reliable exploration incentives and alleviating the biased objective problem; develop a toolkit that provides highquality implementations of various intrinsic reward modules based on PyTorch
Curiosity in Hindsight: Intrinsic Exploration in Stochastic EnvironmentsCuriosity in HindsightICML23consider exploration in stochastic environments; learn representations of the future that capture precisely the unpredictable aspects of each outcome—which we use as additional input for predictions, such that intrinsic rewards only reflect the predictable aspects of world dynamics
Maximize to Explore: One Objective Function Fusing Estimation, Planning, and ExplorationNeurIPS23 spotlight
MIMEx: Intrinsic Rewards from Masked Input ModelingMIMExNeurIPS23propose that the mask distribution can be flexibly tuned to control the difficulty of the underlying conditional prediction task

Representation Learning

Note: representation learning with MBRL is in the part World Models

TitleMethodConferenceDescription
CURL: Contrastive Unsupervised Representations for Reinforcement LearningCURLICML20extracts high-level features from raw pixels using contrastive learning and performs offpolicy control on top of the extracted features
Learning Invariant Representations for Reinforcement Learning without ReconstructionDBCICLR21propose using Bisimulation to learn robust latent representations which encode only the task-relevant information from observations
Reinforcement Learning with Prototypical RepresentationsProto-RLICML21pre-train task-agnostic representations and prototypes on environments without downstream task information
Understanding the World Through Action----CoRL21discusse how self-supervised reinforcement learning combined with offline RL can enable scalable representation learning
Flow-based Recurrent Belief State Learning for POMDPsFORBESICML22incorporate normalizing flows into the variational inference to learn general continuous belief states for POMDPs
Contrastive Learning as Goal-Conditioned Reinforcement LearningContrastive RLNeurIPS22show (contrastive) representation learning methods can be cast as RL algorithms in their own right
Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?----NeurIPS22conduct an extensive comparison of various self-supervised losses under the existing joint learning framework for pixel-based reinforcement learning in many environments from different benchmarks, including one real-world environment
Reinforcement Learning with Automated Auxiliary Loss SearchA2LSNeurIPS22propose to automatically search top-performing auxiliary loss functions for learning better representations in RL; define a general auxiliary loss space of size 7.5 × 1020 based on the collected trajectory data and explore the space with an efficient evolutionary search strategy
Mask-based Latent Reconstruction for Reinforcement LearningMLRNeurIPS22propose an effective self-supervised method to predict complete state representations in the latent space from the observations with spatially and temporally masked pixels
Towards Universal Visual Reward and Representation via Value-Implicit Pre-TrainingVIPICLR23 Spotlightcast representation learning from human videos as an offline goal-conditioned reinforcement learning problem; derive a self-supervised dual goal-conditioned value-function objective that does not depend on actions, enabling pre-training on unlabeled human videos
Latent Variable Representation for Reinforcement Learning----ICLR23provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle in the face of uncertainty for exploration
Spectral Decomposition Representation for Reinforcement LearningICLR23
Become a Proficient Player with Limited Data through Watching Pure VideosFICCICLR23consider the setting where the pre-training data are action-free videos; introduce a two-phase training pipeline; pre-training phase: implicitly extract the hidden action embedding from videos and pre-train the visual representation and the environment dynamics network based on vector quantization; down-stream tasks: finetune with small amount of task data based on the learned models
Bootstrapped Representations in Reinforcement Learning----ICML23provide a theoretical characterization of the state representation learnt by temporal difference learning; find that this representation differs from the features learned by Monte Carlo and residual gradient algorithms for most transition structures of the environment in the policy evaluation setting
Representation-Driven Reinforcement LearningRepRLICML23reduce the policy search problem to a contextual bandit problem, using a mapping from policy space to a linear feature space
Conditional Mutual Information for Disentangled Representations in Reinforcement LearningCMIDNeurIPS23 spotlightpropose an auxiliary task for RL algorithms that learns a disentangled representation of high-dimensional observations with correlated features by minimising the conditional mutual information between features in the representation

Unsupervised Learning

TitleMethodConferenceDescription
Variational Intrinsic Control----arXiv1611introduce a new unsupervised reinforcement learning method for discovering the set of intrinsic options available to an agent, which is learned by maximizing the number of different states an agent can reliably reach, as measured by the mutual information between the set of options and option termination states
Diversity is All You Need: Learning Skills without a Reward FunctionDIAYNICLR19learn diverse skills in environments without any rewards by maximizing an information theoretic objective
Unsupervised Control Through Non-Parametric Discriminative RewardsICLR19
Dynamics-Aware Unsupervised Discovery of SkillsDADSICLR20propose to learn low-level skills using model-free RL with the explicit aim of making model-based control easy
Fast task inference with variational intrinsic successor featuresVISRICLR20
Decoupling representation learning from reinforcement learningATCICML21propose a new unsupervised task tailored to reinforcement learning named Augmented Temporal Contrast (ATC), which borrows ideas from Contrastive learning; benchmark several leading Unsupervised Learning algorithms by pre-training encoders on expert demonstrations and using them in RL agents
Unsupervised Skill Discovery with Bottleneck Option LearningIBOLICML21propose a novel skill discovery method with information bottleneck, which provides multiple benefits including learning skills in a more disentangled and interpretable way with respect to skill latents and being robust to nuisance information
APS: Active Pretraining with Successor FeaturesAPSICML21address the issues of APT and VISR by combining them together in a novel way
Behavior From the Void: Unsupervised Active Pre-TrainingAPTNeurIPS21propose a non-parametric entropy computed in an abstract representation space; compute the average of the Euclidean distance of each particle to its nearest neighbors for a set of samples
Pretraining representations for data-efficient reinforcement learningSGINeurIPS21consider to pretrian with unlabeled data and finetune on a small amount of task-specific data to improve the data efficiency of RL; employ a combination of latent dynamics modelling and unsupervised goal-conditioned RL
URLB: Unsupervised Reinforcement Learning BenchmarkURLBNeurIPS21a benchmark for unsupervised reinforcement learning
Discovering and Achieving Goals via World ModelsLEXANeurIPS21unsupervised train both an explorer and an achiever policy via imagined rollouts in world models; after the unsupervised phase, solve tasks specified as goal images zero-shot without any additional learning
The Information Geometry of Unsupervised Reinforcement Learning----ICLR22 oralshow that unsupervised skill discovery algorithms based on mutual information maximization do not learn skills that are optimal for every possible reward function; provide a geometric perspective on some skill learning methods
Lipschitz Constrained Unsupervised Skill DiscoveryLSDICLR22argue that the MI-based skill discovery methods can easily maximize the MI objective with only slight differences in the state space; propose a novel objective based on a Lipschitz-constrained state representation function so that the objective maximization in the latent space always entails an increase in traveled distances (or variations) in the state space
Learning more skills through optimistic explorationDISDAINICLR22derive an information gain auxiliary objective that involves training an ensemble of discriminators and rewarding the policy for their disagreement; the objective directly estimates the epistemic uncertainty that comes from the discriminator not having seen enough training examples
A Mixture of Surprises for Unsupervised Reinforcement LearningMOSSNeurIPS22train one mixture component whose objective is to maximize the surprise and another whose objective is to minimize the surprise for handling the setting that the entropy of the environment’s dynamics may be unknown
Unsupervised Reinforcement Learning with Contrastive Intrinsic ControlCICNeurIPS22propose to maximize the mutual information between statetransitions and latent skill vectors
Unsupervised Skill Discovery via Recurrent Skill TrainingReSTNeurIPS22encourage the latter trained skills to avoid entering the same states covered by the previous skills
Choreographer: Learning and Adapting Skills in ImaginationChoreographerICLR23 Spotlightdecouples the exploration and skill learning processes; uses a meta-controller to evaluate and adapt the learned skills efficiently by deploying them in parallel in imagination
Provable Unsupervised Data Sharing for Offline Reinforcement LearningICLR23
Discovering Policies with DOMiNO: Diversity Optimization Maintaining Near OptimalityICLR23
Mastering the Unsupervised Reinforcement Learning Benchmark from PixelsDyna-MPCICML23 oralutilize unsupervised model-based RL for pre-training the agent; finetune downstream tasks via a task-aware finetuning strategy combined with a hybrid planner, Dyna-MPC
On the Importance of Feature Decorrelation for Unsupervised Representation Learning in Reinforcement LearningSimTPRICML23propose a novel URL framework that causally predicts future states while increasing the dimension of the latent manifold by decorrelating the features in the latent space
CLUTR: Curriculum Learning via Unsupervised Task Representation LearningICML23
Controllability-Aware Unsupervised Skill DiscoveryCSDICML23train a controllability-aware distance function based on the current skill repertoire and combine it with distance-maximizing skill discovery
Behavior Contrastive Learning for Unsupervised Skill DiscoveryBeCLICML23propose a novel unsupervised skill discovery method through contrastive learning among behaviors, which makes the agent produce similar behaviors for the same skill and diverse behaviors for different skills
Variational Curriculum Reinforcement Learning for Unsupervised Discovery of SkillsICML23
Learning to Discover Skills through GuidanceDISCO-DANCENeurIPS23selects the guide skill that possesses the highest potential to reach unexplored states; guides other skills to follow the guide skill; the guided skills are dispersed to maximize their discriminability in unexplored states
Creating Multi-Level Skill Hierarchies in Reinforcement LearningNeurIPS23
Unsupervised Behavior Extraction via Random Intent PriorsNeurIPS23
METRA: Scalable Unsupervised RL with Metric-Aware AbstractionMETRAICLR24 oral
Language Guided Skill DiscoveryLGSDarXiv2406take user prompts as input and output a set of semantically distinctive skills
PEAC: Unsupervised Pre-training for Cross-Embodiment Reinforcement LearningCEURL, PEACNeurIPS24consider unsupervised pre-training across a distribution of embodiments, namely CEURL; propose PEAC for handling CEURL
Exploratory Diffusion Model for Unsupervised Reinforcement LearningExDMICLR26 oralutilize diffusion models for boosting unsupervised exploration and for fine-tuning pre-trained diffusion policies

Current methods

TitleMethodConferenceDescription
Weighted importance sampling for off-policy learning with linear function approximationWIS-LSTDNeurIPS14
Importance Sampling Policy Evaluation with an Estimated Behavior PolicyRISICML19
Provably efficient RL with Rich Observations via Latent State DecodingBlock MDPICML19
Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO----ICLR20show that the improvement of performance is related to code-level optimizations
Reinforcement Learning with Augmented DataRADNeurIPS20propose first extensive study of general data augmentations for RL on both pixel-based and state-based inputs
Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from PixelsDrQICLR21 Spotlightpropose to regularize the value function when applying data augmentation with model-free methods and reach state-of-the-art performance in image-pixels tasks
What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study----ICLR21do a large scale empirical study to evaluate different tricks for on-policy algorithms on MuJoCo
Mirror Descent Policy OptimizationMDPOICLR21
Learning Invariant Representations for Reinforcement Learning without ReconstructionDBCICLR21
Randomized Ensemble Double Q-Learning: Learning Fast Without a ModelREDQICLR21consider three ingredients: (i) update q functions many times at every epoch; (ii) use an ensemble of Q functions; (iii) use the minimization across a random subset of Q functions from the ensemble for avoiding the overestimation; propose REDQ and achieve similar performance with model-based methods
Deep Reinforcement Learning at the Edge of the Statistical Precipice----NeurIPS21 oustandstanding paperadvocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results, as well as present more robust and efficient aggregate metrics, such as interquartile mean scores, to achieve small uncertainty in results; [rliable]
Generalizable Episodic Memory for Deep Reinforcement LearningGEMICML21propose to integrate the generalization ability of neural networks and the fast retrieval manner of episodic memory
A Max-Min Entropy Framework for Reinforcement LearningMMENeurIPS21find that SAC may fail in explore states with low entropy (arrive states with high entropy and increase their entropies); propose a max-min entropy framework to address this issue
Maximum Entropy RL (Provably) Solves Some Robust RL Problems ----ICLR22theoretically prove that
SO(2)-Equivariant Reinforcement LearningEqui DQN, Equi SACICLR22 Spotlightconsider to learn transformation-invariant policies and value functions; define and analyze group equivariant MDPs
CoBERL: Contrastive BERT for Reinforcement LearningCoBERLICLR22 Spotlightpropose Contrastive BERT for RL (COBERL) that combines a new contrastive loss and a hybrid LSTM-transformer architecture to tackle the challenge of improving data efficiency
Understanding and Preventing Capacity Loss in Reinforcement LearningInFeRICLR22 Spotlightpropose that deep RL agents lose some of their capacity to quickly fit new prediction tasks during training; propose InFeR to regularize a set of network outputs towards their initial values
On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning----ICLR22 Spotlightconsider lottery ticket hypothesis in deep reinforcement learning
Reinforcement Learning with Sparse Rewards using Guidance from Offline DemonstrationLOGOICLR22 Spotlightconsider the sparse reward challenges in RL; propose LOGO that exploits the offline demonstration data generated by a sub-optimal behavior policy; each step of LOGO contains a policy improvement step via TRPO and an additional policy guidance step by using the sub-optimal behavior policy
Sample Efficient Deep Reinforcement Learning via Uncertainty EstimationIV-RLICLR22 Spotlightanalyze the sources of uncertainty in the supervision of modelfree DRL algorithms, and show that the variance of the supervision noise can be estimated with negative log-likelihood and variance ensembles
Generative Planning for Temporally Coordinated Exploration in Reinforcement LearningGPMICLR22 Spotlightfocus on generating consistent actions for model-free RL, and borrow ideas from Model-based planning and action-repeat; use the policy to generate multi-step actions
When should agents explore?----ICLR22 Spotlightconsider when to explore and propose to choose a heterogeneous mode-switching behavior policy
Maximizing Ensemble Diversity in Deep Reinforcement LearningMED-RLICLR22
Maximum Entropy RL (Provably) Solves Some Robust RL Problems ----ICLR22theoretically prove that standard maximum entropy RL is robust to some disturbances in the dynamics and the reward function
Learning Generalizable Representations for Reinforcement Learning via Adaptive Meta-learner of Behavioral SimilaritiesAMBSICLR22
Large Batch Experience ReplayLaBERICML22 oralcast the replay buffer sampling problem as an importance sampling one for estimating the gradient and derive the theoretically optimal sampling distribution
Do Differentiable Simulators Give Better Gradients for Policy Optimization?----ICML22 oralconsider whether differentiable simulators give better policy gradients; show some pitfalls of First-order estimates and propose alpha-order estimates
Federated Reinforcement Learning: Communication-Efficient Algorithms and Convergence AnalysisICML22 oral
An Analytical Update Rule for General Policy Optimization----ICML22 oralprovide a tighter bound for truse-region methods
Generalised Policy Improvement with Geometric Policy CompositionGSPsICML22 oralpropose the concept of geometric switching policy (GSP), i.e., we have a set of policies and will use them to take action in turn, for each policy, we sample a number from the geometric distribution and take this policy such number of steps; consider policy improvement over nonMarkov GSPs
Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error----ICML22aim to better understand the relationship between the Bellman error and the accuracy of value functions through theoretical analysis and empirical study; point out that the Bellman error is a poor replacement for value error, including (i) The magnitude of the Bellman error hides bias, (ii) Missing transitions breaks the Bellman equation
Adaptive Model Design for Markov Decision Process----ICML22consider Regularized Markov Decision Process and formulate it as a bi-level problem
Stabilizing Off-Policy Deep Reinforcement Learning from PixelsA-LIXICML22propose that temporal-difference learning with a convolutional encoder and lowmagnitude reward will cause instabilities, which is named catastrophic self-overfitting; propose to provide adaptive regularization to the encoder’s gradients that explicitly prevents the occurrence of catastrophic self-overfitting
Understanding Policy Gradient Algorithms: A Sensitivity-Based Approach----ICML22study PG from a perturbation perspective
Mirror Learning: A Unifying Framework of Policy OptimisationMirror LearningICML22propose a novel unified theoretical framework named Mirror Learning to provide theoretical guarantees for General Policy Improvement (GPI) and Trust-Region Learning (TRL); propose an interesting, graph-theoretical perspective on mirror learning
Continuous Control with Action Quantization from DemonstrationsAQuaDemICML22leverag the prior of human demonstrations for reducing a continuous action space to a discrete set of meaningful actions; point out that using a set of actions rather than a single one (Behavioral Cloning) enables to capture the multimodality of behaviors in the demonstrations
Off-Policy Fitted Q-Evaluation with Differentiable Function Approximators: Z-Estimation and Inference Theory----ICML22analyze Fitted Q Evaluation (FQE) with general differentiable function approximators, including neural function approximations by using the Z-estimation theory
The Primacy Bias in Deep Reinforcement Learningprimacy biasICML22find that deep RL agents incur a risk of overfitting to earlier experiences, which will negatively affect the rest of the learning process; propose a simple yet generally-applicable mechanism that tackles the primacy bias by periodically resetting a part of the agent
Optimizing Sequential Experimental Design with Deep Reinforcement LearningICML22use DRL for solving the optimal design of sequential experiments
The Geometry of Robust Value FunctionsICML22study the geometry of the robust value space for the more general Robust MDPs
Utility Theory for Markovian Sequential Decision MakingAffine-Reward MDPsICML22extend von Neumann-Morgenstern (VNM) utility theorem to decision making setting
Reducing Variance in Temporal-Difference Value Estimation via Ensemble of Deep NetworksMeanQICML22consider variance reduction in Temporal-Difference Value Estimation; propose MeanQ to estimate target values by ensembling
Unifying Approximate Gradient Updates for Policy OptimizationICML22
Reinforcement Learning with Neural Radiance FieldsNeRF-RLNeurIPS22propose to train an encoder that maps multiple image observations to a latent space describing the objects in the scene
On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting----NeurIPS22explore the theoretical connections between Reward Maximization (RM) and Distribution Matching (DM)
Faster Deep Reinforcement Learning with Slower Online NetworkDQN Pro, Rainbow ProNeurIPS22incentivize the online network to remain in the proximity of the target network
Reincarnating Reinforcement Learning: Reusing Prior Computation to Accelerate ProgressPVRLNeurIPS22focus on reincarnating RL from any agent to any other agent; present reincarnating RL as an alternative workflow or class of problem settings, where prior computational work (e.g., learned policies) is reused or transferred between design iterations of an RL agent, or from one RL agent to another
Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio BarrierSR-SAC, SR-SPRICLR23 oralshow that fully or partially resetting the parameters of deep reinforcement learning agents causes better replay ratio scaling capabilities to emerge
Guarded Policy Optimization with Imperfect Online DemonstrationsTS2CICLR23 Spotlighth incorporate teacher intervention based on trajectory-based value estimation
Towards Interpretable Deep Reinforcement Learning with Human-Friendly PrototypesPW-NetICLR23 Spotlightfocus on making an “interpretable-by-design” deep reinforcement learning agent which is forced to use human-friendly prototypes in its decisions for making its reasoning process clear; train a “wrapper” model called PW-Net that can be added to any pre-trained agent, which allows them to be interpretable
DEP-RL: Embodied Exploration for Reinforcement Learning in Overactuated and Musculoskeletal SystemsDEP-RLICLR23 Spotlightidentify the DEP controller, known from the field of self-organizing behavior, to generate more effective exploration than other commonly used noise processes; first control the 7 degrees of freedom (DoF) human arm model with RL on a muscle stimulation level
Efficient Deep Reinforcement Learning Requires Regulating Statistical OverfittingAVTDICLR23propose a simple active model selection method (AVTD) that attempts to automatically select regularization schemes by hill-climbing on validation TD error
Greedy Actor-Critic: A New Conditional Cross-Entropy Method for Policy ImprovementCCEM, GreedyACICLR23propose to iteratively take the top percentile of actions, ranked according to the learned action-values; leverage theory for CEM to validate that CCEM concentrates on maximally valued actions across states over time
Reward Design with Language Models----ICLR23explore how to simplify reward design by prompting a large language model (LLM) such as GPT-3 as a proxy reward function, where the user provides a textual prompt containing a few examples (few-shot) or a description (zero-shot) of the desired behavior
Solving Continuous Control via Q-learningDecQNICLR23combine value decomposition with bang-bang action space discretization to DQN to handle continuous control tasks; evaluate on DMControl, Meta World, and Isaac Gym
Wasserstein Auto-encoded MDPs: Formal Verification of Efficiently Distilled RL Policies with Many-sided GuaranteesWAE-MDPICLR23minimize a penalized form of the optimal transport between the behaviors of the agent executing the original policy and the distilled policy
Human-level Atari 200x fasterMEMEICLR23outperform the human baseline across all 57 Atari games in 390M frames; four key components: (1) an approximate trust region method which enables stable bootstrapping from the online network, (2) a normalisation scheme for the loss and priorities which improves robustness when learning a set of value functions with a wide range of scales, (3) an improved architecture employing techniques from NFNets in order to leverage deeper networks without the need for normalization layers, and (4) a policy distillation method which serves to smooth out the instantaneous greedy policy over time.
Improving Deep Policy Gradients with Value Function SearchVFSICLR23focus on improving value approximation and analyzing the effects on Deep PG primitives such as value prediction, variance reduction, and correlation of gradient estimates with the true gradient; show that value functions with better predictions improve Deep PG primitives, leading to better sample efficiency and policies with higher returns
Memory Gym: Partially Observable Challenges to Memory-Based AgentsMemory GymICLR23a benchmark for challenging Deep Reinforcement Learning agents to memorize events across long sequences, be robust to noise, and generalize; consists of the partially observable 2D and discrete control environments Mortar Mayhem, Mystery Path, and Searing Spotlights; [code]
Hybrid RL: Using both offline and online data can make RL efficientHy-QICLR23focus on a hybrid setting named Hybrid RL, where the agent has both an offline dataset and the ability to interact with the environment; extend fitted Q-iteration algorithm
POPGym: Benchmarking Partially Observable Reinforcement LearningPOPGymICLR23a two-part library containing (1) a diverse collection of 15 partially observable environments, each with multiple difficulties and (2) implementations of 13 memory model baselines; [code]
Critic Sequential Monte CarloCriticSMCICLR23combine sequential Monte Carlo with learned Soft-Q function heuristic factors
Planning-oriented Autonomous DrivingCVPR23 best paper
On the Reuse Bias in Off-Policy Reinforcement LearningBIRISIJCAI23discuss the bias of off-policy evaluation due to reusing the replay buffer; derive a high-probability bound of the Reuse Bias; introduce the concept of stability for off-policy algorithms and provide an upper bound for stable off-policy algorithms
The Dormant Neuron Phenomenon in Deep Reinforcement LearningReDoICML23 oralunderstand the underlying reasons behind the loss of expressivity during the training of RL agents; demonstrate the existence of the dormant neuron phenomenon in deep RL; propose Recycling Dormant neurons (ReDo) to reduce the number of dormant neurons and maintain network expressivity during training
Efficient RL via Disentangled Environment and Agent RepresentationsSEARICML23 oralconsider to build a representation that can disentangle a robotic agent from its environment for improving the learning efficiency for RL; augment the RL loss with an agent-centric auxiliary loss
On the Statistical Benefits of Temporal Difference Learning----ICML23 oralprovide crisp insight into the statistical benefits of TD
Settling the Reward Hypothesis----ICML23 oralprovide a treatment of the reward hypothesis in both the setting that goals are the subjective desires of the agent and in the setting where goals are the objective desires of an agent designer
Learning Belief Representations for Partially Observable Deep RLBelieverICML23decouple belief state modelling (via unsupervised learning) from policy optimization (via RL); propose a representation learning approach to capture a compact set of reward-relevant features of the state
Internally Rewarded Reinforcement LearningIRRLICML23study a class of reinforcement learning problems where the reward signals for policy learning are generated by an internal reward model that is dependent on and jointly optimized with the policy; theoretically derive and empirically analyze the effect of the reward function in IRRL and based on these analyses propose the clipped linear reward function
Hyperparameters in Reinforcement Learning and How To Tune Them----ICML23Exploration of the hyperparameter landscape for commonly-used RL algorithms and environments; Comparison of different types of HPO methods on state-of-the-art RL algorithms and challenging RL environments
Langevin Thompson Sampling with Logarithmic Communication: Bandits and Reinforcement LearningICML23
Correcting discount-factor mismatch in on-policy policy gradient methods----ICML23introduce a novel distribution correction to account for the discounted stationary distribution
Reinforcement Learning Can Be More Efficient with Multiple Rewards----ICML23theoretically analyze multi-reward extensions of action-elimination algorithms and prove more favorable instance-dependent regret bounds compared to their single-reward counterparts, both in multi-armed bandits and in tabular Markov decision processes
Performative Reinforcement Learning----ICML23introduce the framework of performative reinforcement learning where the policy chosen by the learner affects the underlying reward and transition dynamics of the environment
Reinforcement Learning with History Dependent Dynamic ContextsDCMDPsICML23introduce DCMDPs, a novel reinforcement learning framework for history-dependent environments that handles non-Markov environments, where contexts change over time; derive an upper-confidence-bound style algorithm for logistic DCMDPs
On Many-Actions Policy GradientMBMAICML23propose MBMA, an approach leveraging dynamics models for many-actions sampling in the context of stochastic policy gradient (SPG). which yields lower bias and comparable variance to SPG estimated from states in model-simulated rollouts
Scaling Laws for Reward Model Overoptimization----ICML23study overoptimization in the context of large language models fine-tuned as reward models trained to predict which of two options a human will prefer; study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-n sampling
Bigger, Better, Faster: Human-level Atari with human-level efficiencyBBFICML23rely on scaling the neural networks used for value estimation and a number of other design choices like resetting
Synthetic Experience ReplaySynthERNeurIPS23utilize diffusion to augment data in the replay buffer; evaluate in both online RL and offline RL
OMPO: A Unified Framework for RL under Policy and Dynamics ShiftsOMPOICML24 oralconsider the distribution discrepancies induced by policy or dynamics shifts; propose a surrogate policy learning objective by considering the transition occupancy discrepancies and then cast it into a tractable min-max optimization problem through dual reformulation

Model Based (Online) RL

Classic Methods

TitleMethodConferenceDescription
Value-Aware Loss Function for Model-based Reinforcement LearningVAMLAISTATS17propose to train model by using the difference between TD error rather than KL-divergence
Model-Ensemble Trust-Region Policy OptimizationME-TRPOICLR18analyze the behavior of vanilla MBRL methods with DNN; propose ME-TRPO with two ideas: (i) use an ensemble of models, (ii) use likelihood ratio derivatives; significantly reduce the sample complexity compared to model-free methods
Model-Based Value Expansion for Efficient Model-Free Reinforcement LearningMVEICML18use a dynamics model to simulate the short-term horizon and Q-learning to estimate the long-term value beyond the simulation horizon; use the trained model and the policy to estimate k-step value function for updating value function
Iterative value-aware model learningIterVAMLNeurIPS18replace e the supremum in VAML with the current estimate of the value function
Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value ExpansionSTEVENeurIPS18an extension to MVE; only utilize roll-outs without introducing significant errors
Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics ModelsPETSNeurIPS18propose PETS that incorporate uncertainty via an ensemble of bootstrapped models
Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical GuaranteesSLBOICLR19propose a novel algorithmic framework for designing and analyzing model-based RL algorithms with theoretical guarantees: provide a lower bound of the true return satisfying some properties s.t. optimizing this lower bound can actually optimize the true return
When to Trust Your Model: Model-Based Policy OptimizationMBPONeurIPS19propose MBPO with monotonic model-based improvement; theoretically discuss how to choose k for model rollouts
Model Based Reinforcement Learning for AtariSimPLeICLR20first successfully handle ALE benchmark with model-based method with some designs: (i) deterministic Model; (ii) well-designed loss functions; (iii) scheduled sampling; (iv) stochastic Models
Bidirectional Model-based Policy OptimizationBMPOICML20an extension to MBPO; consider both forward dynamics model and backward dynamics model
Context-aware Dynamics Model for Generalization in Model-Based Reinforcement LearningCaDMICML20develop a context-aware dynamics model (CaDM) capable of generalizing across a distribution of environments with varying transition dynamics; introduce a backward dynamics model that predicts a previous state by utilizing a context latent vector
A Game Theoretic Framework for Model Based Reinforcement LearningPAL, MALICML20develop a novel framework that casts MBRL as a game between a policy player and a model player; setup a Stackelberg game between the two players
Planning to Explore via Self-Supervised World ModelsPlan2ExploreICML20propose a self-supervised reinforcement learning agent for addressing two challenges: quick adaptation and expected future novelty
Trust the Model When It Is Confident: Masked Model-based Actor-CriticM2ACNeurIPS20an extension to MBPO; use model rollouts only when the model is confident
The LoCA Regret: A Consistent Metric to Evaluate Model-Based Behavior in Reinforcement LearningLoCANeurIPS20propose LoCA to measure how quickly a method adapts its policy after the environment is changed from the first task to the second
Generative Temporal Difference Learning for Infinite-Horizon PredictionGHM, or gamma-modelNeurIPS20propose gamma-model to make long-horizon predictions without the need to repeatedly apply a single-step model
Models, Pixels, and Rewards: Evaluating Design Trade-offs in Visual Model-Based Reinforcement Learning----arXiv2012study a number of design decisions for the predictive model in visual MBRL algorithms, focusing specifically on methods that use a predictive model for planning
Mastering Atari Games with Limited DataEfficientZeroNeurIPS21first achieve super-human performance on Atari games with limited data; propose EfficientZero with three components: (i) use self-supervised learning to learn a temporally consistent environment model, (ii) learn the value prefix in an end-to-end manner, (iii) use the learned model to correct off-policy value targets
On Effective Scheduling of Model-based Reinforcement LearningAutoMBPONeurIPS21an extension to MBPO; automatically schedule the real data ratio as well as other hyperparameters for MBPO
Model-Advantage and Value-Aware Models for Model-Based Reinforcement Learning: Bridging the Gap in Theory and Practice----arxiv22bridge the gap in theory and practice of value-aware model learning (VAML) for model-based RL
Value Gradient weighted Model-Based Reinforcement LearningVaGraMICLR22 Spotlightconsider the objective mismatch problem in MBRL; propose VaGraM by rescaling the MSE loss function with gradient information from the current value function estimate
Constrained Policy Optimization via Bayesian World ModelsLAMBDAICLR22 Spotlightconsider Bayesian model-based methods for CMDP
On-Policy Model Errors in Reinforcement LearningOPCICLR22consider to combine real-world data and a learned model in order to get the best of both worlds; propose to exploit the real-world data for onpolicy predictions and use the learned model only to generalize to different actions; propose to use on-policy transition data on top of a separately learned model to enable accurate long-term predictions for MBRL
Temporal Difference Learning for Model Predictive ControlTD-MPCICML22propose to use the model only to predice reward; use a policy to accelerate the planning
Causal Dynamics Learning for Task-Independent State AbstractionICML22
Mismatched no More: Joint Model-Policy Optimization for Model-Based RLMnMNeurIPS22propose a model-based RL algorithm where the model and policy are jointly optimized with respect to the same objective, which is a lower bound on the expected return under the true environment dynamics, and becomes tight under certain assumptions
Reinforcement Learning with Non-Exponential Discounting----NeurIPS22propose a theory for continuous-time model-based reinforcement learning generalized to arbitrary discount functions; derive a Hamilton–Jacobi–Bellman (HJB) equation characterizing the optimal policy and describe how it can be solved using a collocation method
Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One ObjectiveALMICLR23propose a single objective which jointly optimizes the policy, the latent-space model, and the representations produced by the encoder using the same objective: maximize predicted rewards while minimizing the errors in the predicted representations
SpeedyZero: Mastering Atari with Limited Data and TimeSpeedyZeroICLR23a distributed RL system built upon EfficientZero with Priority Refresh and Clipped LARS; lead to human-level performances on the Atari benchmark within 35 minutes using only 300k samples
Investigating the role of model-based learning in exploration and transferICML23
STEERING : Stein Information Directed Exploration for Model-Based Reinforcement LearningSTEERINGICML23
Predictable MDP Abstraction for Unsupervised Model-Based RLPMAICML23apply model-based RL on top of an abstracted, simplified MDP, by restricting unpredictable actions
The Virtues of Laziness in Model-based RL: A Unified Objective and AlgorithmsICML23
Stop Regressing: Training Value Functions via Classification for Scalable Deep RLHL-GaussICML24 oralshow that training value functions with categorical cross-entropy significantly enhances performance and scalability across various domains, including single-task RL on Atari 2600 games, multi-task RL on Atari with large-scale ResNets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers, achieving stateof-the-art results on these domains
Trust the Model Where It Trusts Itself: Model-Based Actor-Critic with Uncertainty-Aware Rollout AdaptionMACURAICML24propose an easy-to-tune mechanism for model-based rollout length scheduling

World Models

TitleMethodConferenceDescription
World Models, [NeurIPS version]World ModelsNeurIPS18use an unsupervised manner to learn a compressed spatial and temporal representation of the environment and use the world model to train a very compact and simple policy for solving the required task
Learning latent dynamics for planning from pixelsPlaNetICML19propose PlaNet to learn the environment dynamics from images; the dynamic model consists transition model, observation model, reward model and encoder; use the cross entropy method for selecting actions for planning
Dream to Control: Learning Behaviors by Latent ImaginationDreamerICLR20solve long-horizon tasks from images purely by latent imagination; test in image-based MuJoCo; propose to use an agent to replace the control algorithm in the PlaNet
Bridging Imagination and Reality for Model-Based Deep Reinforcement LearningBIRDNeurIPS20propose to maximize the mutual information between imaginary and real trajectories so that the policy improvement learned from imaginary trajectories can be easily generalized to real trajectories
Planning to Explore via Self-Supervised World ModelsPlan2ExploreICML20propose Plan2Explore to self-supervised exploration and fast adaptation to new tasks
Mastering Atari with Discrete World ModelsDreamerv2ICLR21solve long-horizon tasks from images purely by latent imagination; test in image-based Atari
Temporal Predictive Coding For Model-Based Planning In Latent SpaceTPCICML21propose a temporal predictive coding approach for planning from high-dimensional observations and theoretically analyze its ability to prioritize the encoding of task-relevant information
Learning Task Informed AbstractionsTIAICML21introduce the formalism of Task Informed MDP (TiMDP) that is realized by training two models that learn visual features via cooperative reconstruction, but one model is adversarially dissociated from the reward signal
Dreaming: Model-based Reinforcement Learning by Latent Imagination without ReconstructionDreamingICRA21propose a decoder-free extension of Dreamer since the autoencoding based approach often causes object vanishing
Model-Based Reinforcement Learning via Imagination with Derived MemoryIDMNeurIPS21hope to improve the diversity of imagination for model-based policy optimization with the derived memory; point out that current methods cannot effectively enrich the imagination if the latent state is disturbed by random noises
Maximum Entropy Model-based Reinforcement LearningMaxEnt DreamerNeurIPS21create a connection between exploration methods and model-based reinforcement learning; apply maximum-entropy exploration for Dreamer
Discovering and Achieving Goals via World ModelsLEXANeurIPS21unsupervised train both an explorer and an achiever policy via imagined rollouts in world models; after the unsupervised phase, solve tasks specified as goal images zero-shot without any additional learning
TransDreamer: Reinforcement Learning with Transformer World ModelsTransDreamerarxiv2202replace the RNN in RSSM by a transformer
DreamerPro: Reconstruction-Free Model-Based Reinforcement Learning with Prototypical RepresentationsDreamerProICML22consider reconstruction-free MBRL; propose to learn the prototypes from the recurrent states of the world model, thereby distilling temporal structures from past observations and actions into the prototypes.
Towards Evaluating Adaptivity of Model-Based Reinforcement Learning Methods----ICML22introduce an improved version of the LoCA setup and use it to evaluate PlaNet and Dreamerv2
Reinforcement Learning with Action-Free Pre-Training from VideosAPVICML22pre-train an action-free latent video prediction model using videos from different domains, and then fine-tune the pre-trained model on target domains
Denoised MDPs: Learning World Models Better Than the World ItselfDenoised MDPICML22divide information into four categories: controllable/uncontrollable (whether infected by the action) and reward-relevant/irrelevant (whether affects the return); propose to only consider information which is controllable and reward-relevant
DreamingV2: Reinforcement Learning with Discrete World Models without ReconstructionDreamingv2arxiv2203adopt both the discrete representation of DreamerV2 and the reconstruction-free objective of Dreaming
Masked World Models for Visual ControlMWMarxiv2206decouple visual representation learning and dynamics learning for visual model-based RL and use masked autoencoder to train visual representation
DayDreamer: World Models for Physical Robot LearningDayDreamerarxiv2206apply Dreamer to 4 robots to learn online and directly in the real world, without any simulators
Iso-Dream: Isolating Noncontrollable Visual Dynamics in World ModelsIso-DreamNeurIPS22consider noncontrollable dynamics independent of the action signals; encourage the world model to learn controllable and noncontrollable sources of spatiotemporal changes on isolated state transition branches; optimize the behavior of the agent on the decoupled latent imaginations of the world model
Learning General World Models in a Handful of Reward-Free DeploymentsCASCADENeurIPS22introduce the reward-free deployment efficiency setting to facilitate generalization (exploration should be task agnostic) and scalability (exploration policies should collect large quantities of data without costly centralized retraining); propose an information theoretic objective inspired by Bayesian Active Learning by specifically maximizing the diversity of trajectories sampled by the population through a novel cascading objective
Learning Robust Dynamics through Variational Sparse GatingVSG, SVSG, BBSNeurIPS22consider to sparsely update the latent states at each step; develope a new partially-observable and stochastic environment, called BringBackShapes (BBS)
Transformers are Sample Efficient World ModelsIRISICLR23 oraluse a discrete autoencoder and an autoregressive Transformer to conduct World Models and significantly improve the data efficiency in Atari (2 hours of real-time experience); [code]
Transformer-based World Models Are Happy With 100k InteractionsTWMICLR23present a new autoregressive world model based on the Transformer-XL; obtain excellent results on the Atari 100k benchmark; [code]
Dynamic Update-to-Data Ratio: Minimizing World Model OverfittingDUTDICLR23propose a new general method that dynamically adjusts the update to data (UTD) ratio during training based on underand overfitting detection on a small subset of the continuously collected experience not used for training; apply this method in DreamerV2
Evaluating Long-Term Memory in 3D MazesMemory MazeICLR23introduce the Memory Maze, a 3D domain of randomized mazes specifically designed for evaluating long-term memory in agents, including an online reinforcement learning benchmark, a diverse offline dataset, and an offline probing evaluation; [code]
Mastering Diverse Domains through World ModelsDreamerV3arxiv2301propose DreamerV3 to handle a wide range of domains, including continuous and discrete actions, visual and low-dimensional inputs, 2D and 3D worlds, different data budgets, reward frequencies, and reward scales
Task Aware Dreamer for Task Generalization in Reinforcement LearningTADarXiv2303propose Task Distribution Relevance to capture the relevance of the task distribution quantitatively; propose TAD to use world models to improve task generalization via encoding reward signals into policies
Reparameterized Policy Learning for Multimodal Trajectory OptimizationRPGICML23 oralpropose a principled framework that models the continuous RL policy as a generative model of optimal trajectories; present RPG to leverages the multimodal policy parameterization and learned world model to achieve strong exploration capabilities and high data efficiency
Mastering the Unsupervised Reinforcement Learning Benchmark from PixelsDyna-MPCICML23 oralutilize unsupervised model-based RL for pre-training the agent; finetune downstream tasks via a task-aware finetuning strategy combined with a hybrid planner, Dyna-MPC
Posterior Sampling for Deep Reinforcement LearningPSDRLICML23combine efficient uncertainty quantification over latent state space models with a specially tailored continual planning algorithm based on value-function approximation
Model-based Reinforcement Learning with Scalable Composite Policy Gradient EstimatorsTPXICML23propose Total Propagation X, the first composite gradient estimation algorithm using inverse variance weighting that is demonstrated to be applicable at scale; combine TPX with Dreamer
Go Beyond Imagination: Maximizing Episodic Reachability with World ModelsGoBIICML23combine the traditional lifelong novelty motivation with an episodic intrinsic reward that is designed to maximize the stepwise reachability expansion; apply learned world models to generate predicted future states with random actions
Simplified Temporal Consistency Reinforcement LearningTCRLICML23propose a simple representation learning approach relying only on a latent dynamics model trained by latent temporal consistency and it is sufficient for high-performance RL
Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World ModellingDECKARDICML23hypothesize an Abstract World Model (AWM) over subgoals by few-shot prompting an LLM
Demonstration-free Autonomous Reinforcement Learning via Implicit and Bidirectional CurriculumICML23
Curious Replay for Model-based AdaptationCRICML23aid model-based RL agent adaptation by prioritizing replay of experiences the agent knows the least about
Multi-View Masked World Models for Visual Robotic ManipulationMV-MWMICML23train a multi-view masked autoencoder that reconstructs pixels of randomly masked viewpoints and then learn a world model operating on the representations from the autoencoder
Facing off World Model Backbones: RNNs, Transformers, and S4S4WMNeurIPS23propose the first S4-based world model that can generate high-dimensional image sequences through latent imagination

CodeBase

TitleConferenceMethodsGithub
MBRL-Lib: A Modular Library for Model-based Reinforcement Learningarxiv21MBPO,PETS,PlaNetlink

(Model Free) Offline RL

Current Methods

TitleMethodConferenceDescription
Off-Policy Deep Reinforcement Learning without ExplorationBCQICML19show that off-policy methods perform badly because of extrapolation error; propose batch-constrained reinforcement learning: maximizing the return as well as minimizing the mismatch between the state-action visitation of the policy and the state-action pairs contained in the batch
Conservative Q-Learning for Offline Reinforcement LearningCQLNeurIPS20propose CQL with conservative q function, which is a lower bound of its true value, since standard off-policy methods will overestimate the value function
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems----arxiv20tutorial about methods, applications and open problems of offline rl
Uncertainty-Based Offline Reinforcement Learning with Diversified Q-EnsembleNeurIPS21
A Minimalist Approach to Offline Reinforcement LearningTD3+BCNeurIPS21propsoe to add a behavior cloning term to regularize the policy, and normalize the states over the dataset
DR3: Value-Based Deep Reinforcement Learning Requires Explicit RegularizationDR3ICLR22 Spotlightconsider the implicit regularization effect of SGD in RL; based on theoretical analyses, propose an explicit regularizer, called DR3, and combine with offline RL methods
Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning PBRLICLR22 Spotlightconsider the distributional shift and extrapolation error in offline RL; propose PBRL with bootstrapping, for uncertainty quantification, and an OOD sampling method as a regularizer
COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction EstimationCOptiDICEICLR22 Spotlightconsider offline constrained reinforcement learning; propose COptiDICE to directly optimize the distribution of state-action pair with contraints
Offline Reinforcement Learning with Value-based Episodic MemoryEVL, VEMICLR22present a new offline V -learning method to learn the value function through the trade-offs between imitation learning and optimal value learning; use a memory-based planning scheme to enhance advantage estimation and conduct policy learning in a regression manner
Offline reinforcement learning with implicit Q-learningIQLICLR22propose to learn an optimal policy with in-sample learning, without ever querying the values of any unseen actions
Offline RL Policies Should Be Trained to be AdaptiveAPE-VICML22 oralshow that learning from an offline dataset does not fully specify the environment; formally demonstrate the necessity of adaptation in offline RL by using the Bayesian formalism and to provide a practical algorithm for learning optimally adaptive policies; propose an ensemble-based offline RL algorithm that imbues policies with the ability to adapt within an episode
When Data Geometry Meets Deep Function: Generalizing Offline Reinforcement LearningDOGEICLR23train a state-conditioned distance function that can be readily plugged into standard actor-critic methods as a policy constraint
Jump-Start Reinforcement LearningJSRLICML23consider the setting that employs two policies to solve tasks: a guide-policy, and an exploration-policy; bootstrap an RL algorithm by gradually “rolling in” with the guide-policy

Combined with Diffusion Models

TitleMethodConferenceDescription
Planning with Diffusion for Flexible Behavior SynthesisDiffuserICML22 oralfirst propose a denoising diffusion model designed for trajectory data and an associated probabilistic framework for behavior synthesis; demonstrate that Diffuser has a number of useful properties and is particularly effective in offline control settings that require long-horizon reasoning and test-time flexibility
Is Conditional Generative Modeling all you need for Decision Making?ICLR23 oral
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement LearningDiffusion-QLICLR23perform policy regularization using diffusion (or scorebased) models; utilize a conditional diffusion model to represent the policy
Offline Reinforcement Learning via High-Fidelity Generative Behavior ModelingSfBCICLR23decouple the learned policy into two parts: an expressive generative behavior model and an action evaluation model
AdaptDiffuser: Diffusion Models as Adaptive Self-evolving PlannersAdaptDiffuserICML23 oralpropose AdaptDiffuser, an evolutionary planning method with diffusion that can self-evolve to improve the diffusion model hence a better planner, which can also adapt to unseen tasks
Energy-Guided Diffusion Sampling for Offline-to-Online Reinforcement LearningEDISICML24utilizes a diffusion model to extract prior knowledge from the offline dataset and employs energy functions to distill this knowledge for enhanced data generation in the online phase; formulate three distinct energy functions to guide the diffusion sampling process for the distribution alignment
DIDI: Diffusion-Guided Diversity for Offline Behavioral GenerationDIDIICML24propose to learn a diverse set of skills from a mixture of label-free offline data

Model Based Offline RL

TitleMethodConferenceDescription
Deployment-Efficient Reinforcement Learning via Model-Based Offline OptimizationBREMENICLR20propose deployment efficiency, to count the number of changes in the data-collection policy during learning (offline: 1, online: no limit); propose BERMEN with an ensemble of dynamics models for off-policy and offline rl
MOPO: Model-based Offline Policy OptimizationMOPONeurIPS20observe that existing model-based RL algorithms can improve the performance of offline RL compared with model free RL algorithms; design MOPO by extending MBPO on uncertainty-penalized MDPs (new_reward = reward - uncertainty)
MOReL: Model-Based Offline Reinforcement LearningMOReLNeurIPS20present MOReL for model-based offline RL, including two steps: (a) learning a pessimistic MDP, (b) learning a near-optimal policy in this P-MDP
Model-Based Offline PlanningMBOPICLR21learn a model for planning
Representation Balancing Offline Model-Based Reinforcement LearningRepB-SDEICLR21focus on learning the representation for a robust model of the environment under the distribution shift and extend RepBM to deal with the curse of horizon; propose RepB-SDE framework for off-policy evaluation and offline rl
Conservative Objective Models for Effective Offline Model-Based OptimizationCOMsICML21consider offline model-based optimization (MBO, optimize an unknown function only with some samples); add a regularizer (resemble adversarial training methods) to the objective forlearning conservative objective models
COMBO: Conservative Offline Model-Based Policy OptimizationCOMBONeurIPS21try to optimize a lower bound of performance without considering uncertainty quantification; extend CQL with model-based methods
Weighted Model Estimation for Offline Model-Based Reinforcement Learning----NeurIPS21address the covariate shift issue by re-weighting the model losses for different datapoints
Revisiting Design Choices in Model-Based Offline Reinforcement Learning----ICLR22 Spotlightconduct a rigorous investigation into a series of these design choices for Model-based Offline RL
Planning with Diffusion for Flexible Behavior SynthesisDiffuserICML22 oralfirst design a denoising diffusion model for trajectory data and an associated probabilistic framework for behavior synthesis
Learning Temporally Abstract World Models without Online ExperimentationOPOSMICML23present an approach for simultaneously learning sets of skills and temporally abstract, skill-conditioned world models purely from offline data, enabling agents to perform zero-shot online planning of skill sequences for new tasks

Meta RL

TitleMethodConferenceDescription
RL2 : Fast reinforcement learning via slow reinforcement learningRL2arxiv16view the learning process of the agent itself as an objective; structure the agent as a recurrent neural network to store past rewards, actions, observations and termination flags for adapting to the task at hand when deployed
Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksMAMLICML17propose a general framework for different learning problems, including classification, regression andreinforcement learning; the main idea is to optimize the parameters to quickly adapt to new tasks (with a few steps of gradient descent)
Meta reinforcement learning with latent variable gaussian processes----arxiv18
Learning to adapt in dynamic, real-world environments through meta-reinforcement learningReBAL, GrBALICLR18consider learning online adaptation in the context of model-based reinforcement learning
Meta-Learning by Adjusting Priors Based on Extended PAC-Bayes Theory----ICML18extend various PAC-Bayes bounds to meta learning
Meta reinforcement learning of structured exploration strategiesNeurIPS18
Meta-learning surrogate models for sequential decision makingarxiv19
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context VariablesPEARLICML19encode past tasks’ experience with probabilistic latent context and use inference network to estimate the posterior
Fast context adaptation via meta-learningCAVIAICML19propose CAVIA as an extension to MAML that is less prone to meta-overfitting, easier to parallelise, and more interpretable; partition the model parameters into two parts: context parameters and shared parameters, and only update the former one in the test stage
Taming MAML: Efficient Unbiased Meta-Reinforcement LearningICML19
Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement LearningMeta WorldCoRL19an envoriment for meta RL as well as multi-task RL
Guided meta-policy searchGMPSNeurIPS19consider the sample efficiency during the meta-training process by using supervised imitation learning;
Meta-Q-LearningMQLICLR20an off-policy algorithm for meta RL andbuilds upon three simple ideas: (i) Q Learning with context variable represented by pasttrajectories is competitive with SOTA; (ii) Multi-task objective is useful for meta RL; (iii) Past data from the meta-training replay buffer can be recycled
Varibad: A very good method for bayes-adaptive deep RL via meta-learningvariBADICLR20represent a single MDP M using a learned, low-dimensional stochastic latent variable m; jointly meta-train a variational auto-encoder that can infer the posterior distribution over m in a new task, and a policy that conditions on this posterior belief over MDP embeddings
On the global optimality of modelagnostic meta-learning, ICML version----ICML20characterize the optimality gap of the stationary points attained by MAML for both rl and sl
Meta-reinforcement learning robust to distributional shift via model identification and experience relabelingMIERarxiv20
FOCAL: Efficient fully-offline meta-reinforcement learning via distance metric learning and behavior regularizationFOCALICLR21first consider offline meta-reinforcement learning; propose FOCAL based on PEARL
Offline meta reinforcement learning with advantage weightingMACAWICML21introduce the offline meta reinforcement learning problem setting; propose an optimization-based meta-learning algorithm named MACAW that uses simple, supervised regression objectives for both the inner and outer loop of meta-training
Improving Generalization in Meta-RL with Imaginary Tasks from Latent Dynamics MixtureLDMNeurIPS21aim to train an agent that prepares for unseen test tasks during training, propose to train a policy on mixture tasks along with original training tasks for preventing the agent from overfitting the training tasks
Unifying Gradient Estimators for Meta-Reinforcement Learning via Off-Policy Evaluation----NeurIPS21present a unified framework for estimating higher-order derivatives of value functions, based on the concept of off-policy evaluation, for gradient-based meta rl
Generalization of Model-Agnostic Meta-Learning Algorithms: Recurring and Unseen Tasks----NeurIPS21
Offline Meta Learning of Exploration, Offline Meta Reinforcement Learning -- Identifiability Challenges and Effective Data Collection StrategiesBOReLNeurIPS21
On the Convergence Theory of Debiased Model-Agnostic Meta-Reinforcement LearningSG-MRLNeurIPS21
Hindsight Task Relabelling: Experience Replay for Sparse Reward Meta-RL----NeurIPS21
Generalization Bounds for Meta-Learning via PAC-Bayes and Uniform Stability----NeurIPS21provide generalization bound on meta-learning by combining PAC-Bayes thchnique and uniform stability
Bootstrapped Meta-LearningBMGICLR22 Oralpropose BMG to let the metalearner teach itself for tackling ill-conditioning problems and myopic metaobjectives in meta learning; BGM introduces meta-bootstrap to mitigate myopia and formulate the meta-objective in terms of minimising distance to control curvature
Model-Based Offline Meta-Reinforcement Learning with RegularizationMerPO, RACICLR22empirically point out that offline Meta-RL could be outperformed by offline single-task RL methods on tasks with good quality of datasets; consider how to learn an informative offline meta-policy in order to achieve the optimal tradeoff between “exploring” the out-of-distribution state-actions by following the meta-policy and “exploiting” the offline dataset by staying close to the behavior policy; propose MerPO which learns a meta-model for efficient task structure inference and an informative meta-policy for safe exploration of out-of-distribution state-actions
Skill-based Meta-Reinforcement LearningSiMPLICLR22propose a method that jointly leverages (i) a large offline dataset of prior experience collected across many tasks without reward or task annotations and (ii) a set of meta-training tasks to learn how to quickly solve unseen long-horizon tasks.
Hindsight Foresight Relabeling for Meta-Reinforcement LearningHFRICLR22focus on improving the sample efficiency of the meta-training phase via data sharing; combine relabeling techniques with meta-RL algorithms in order to boost both sample efficiency and asymptotic performance
CoMPS: Continual Meta Policy SearchCoMPSICLR22first formulate the continual meta-RL setting, where the agent interacts with a single task at a time and, once finished with a task, never interacts with it again
Learning a subspace of policies for online adaptation in Reinforcement Learning----ICLR22consider the setting with just a single train environment; propose an approach where we learn a subspace of policies within the parameter space
an adaptive deep rl method for non-stationary environments with piecewise stable contextSeCBADNeurIPS22introduce latent situational MDP with piecewise-stable context; jointly infer the belief distribution over latent context with the posterior over segment length and perform more accurate belief context inference with observed data within the current context segment
Model-based Meta Reinforcement Learning using Graph Structured Surrogate Models and Amortized Policy SearchGSSMICML22consider model-based meta reinforcement learning, which consists of dynamics model learning and policy optimization; develop a graph structured dynamics model with superior generalization capability across tasks
Meta-Learning Hypothesis Spaces for Sequential Decision-makingMeta-KeLICML22argue that two critical capabilities of transformers, reason over long-term dependencies and present context-dependent weights from self-attention, compose the central role of a Meta-Reinforcement Learner; propose Meta-LeL for meta-learning the hypothesis space of a sequential decision task
Transformers are Meta-Reinforcement LearnersTrMRLICML22propose TrMRL, a memory-based meta-Reinforcement Learner which uses the transformer architecture to formulate the learning process;
ContraBAR: Contrastive Bayes-Adaptive Deep RLContraBARICML23investigate whether contrastive methods, like contrastive predictive coding, can be used for learning Bayes-optimal behavior

Adversarial RL

TitleMethodConferenceDescription
Adversarial Attacks on Neural Network Policies----ICLR 2017 workshopfirst show that existing rl policies coupled with deep neural networks are vulnerable to adversarial noises in white-box and black-box settings
Delving into Adversarial Attacks on Deep Policies----ICLR 2017 workshopshow rl algorithms are vulnerable to adversarial noises; show adversarial training can improve robustness
Robust Adversarial Reinforcement LearningRARLICML17formulate the robust policy learning as a zero-sum, minimax objective function
Stealthy and Efficient Adversarial Attacks against Deep Reinforcement LearningCritical Point Attack, Antagonist AttackAAAI20critical point attack: build a model to predict the future environmental states and agent’s actions for attacking; antagonist attack: automatically learn a domain-agnostic model for attacking
Safe Reinforcement Learning in Constrained Markov Decision ProcessesSNO-MDPICML20explore and optimize Markov decision processes under unknown safety constraints
Robust Deep Reinforcement Learning Against Adversarial Perturbations on State ObservationsSA-MDPNeurIPS20formalize adversarial attack on state observation as SA-MDP; propose some novel attack methods: Robust SARSA and Maximal Action Difference; propose a defence framework and some practical methods: SA-DQN, SA-PPO and SA-DDPG
Robust Reinforcement Learning on State Observations with Learned Optimal AdversaryATLAICLR21use rl algorithms to train an "optimal" adversary; alternatively train "optimal" adversary and robust agent
Robust Deep Reinforcement Learning through Adversarial LossRADIAL-RLNeurIPS21propose a robust rl framework, which penalizes the overlap between output bounds of actions; propose a more efficient evaluation method (GWC) to measure attack agnostic robustness
Policy Smoothing for Provably Robust Reinforcement LearningPolicy SmoothingICLR22introduce randomized smoothing into RL; propose adaptive Neyman-Person Lemma
CROP: Certifying Robust Policies for Reinforcement Learning through Functional SmoothingCROPICLR22present a framework of Certifying Robust Policies for RL (CROP) against adversarial state perturbations with two certification criteria: robustness of per-state actions and lower bound of cumulative rewards; theoretically prove the certification radius; conduct experiments to provide certification for six empirically robust RL algorithms on Atari
Understanding Adversarial Attacks on Observations in Deep Reinforcement Learning----SCIS 2023summarize current optimization-based adversarial attacks in RL; propose a two-stage methods: train a deceptive policy and mislead the victim to imitate the deceptive policy
Consistent Attack: Universal Adversarial Perturbation on Embodied Vision NavigationReward UAP, Trajectory UAPPRL 2023extend universal adversarial perturbations into sequential decision and propose Reward UAP as well as Trajectory UAP via utilizing the dynamic; experiment in Embodied Vision Navigation tasks

Genaralisation in RL

Environments

TitleMethodConferenceDescription
Quantifying Generalization in Reinforcement LearningCoinRunICML19introduce a new environment called CoinRun for generalisation in RL; empirically show L2 regularization, dropout, data augmentation and batch normalization can improve generalization in RL
Leveraging Procedural Generation to Benchmark Reinforcement LearningProcgen BenchmarkICML20introduce Procgen Benchmark, a suite of 16 procedurally generated game-like environments designed to benchmark both sample efficiency and generalization in reinforcement learning

Methods

TitleMethodConferenceDescription
Towards Generalization and Simplicity in Continuous Control----NeurIPS17policies with simple linear and RBF parameterizations can be trained to solve a variety of widely studied continuous control tasks; training with a diverse initial state distribution induces more global policies with better generalization
Universal Planning NetworksUPNICML18study a model-based architecture that performs a differentiable planning computation in a latent space jointly learned with forward dynamics, trained end-to-end to encode what is necessary for solving tasks by gradient-based planning
On the Generalization Gap in Reparameterizable Reinforcement Learning----ICML19theoretically provide guarantees on the gap between the expected and empirical return for both intrinsic and external errors in reparameterizable RL
Investigating Generalisation in Continuous Deep Reinforcement Learning----arxiv19study generalisation in Deep RL for continuous control
Generalization in Reinforcement Learning with Selective Noise Injection and Information BottleneckSNINeurIPS19consder regularization techniques relying on the injection of noise into the learned function for improving generalization; hope to maintain the regularizing effect of the injected noise and mitigate its adverse effects on the gradient quality
Network randomization: A simple technique for generalization in deep reinforcement learningNetwork RandomizationICLR20introduce a randomized (convolutional) neural network that randomly perturbs input observations, which enables trained agents to adapt to new domains by learning robust features invariant across varied and randomized environments
Observational Overfitting in Reinforcement Learningobservational overfittingICLR20discuss realistic instances where observational overfitting may occur and its difference from other confounding factors, and design a parametric theoretical framework to induce observational overfitting that can be applied to any underlying MDP
Context-aware Dynamics Model for Generalization in Model-Based Reinforcement LearningCaDMICML20decompose the task of learning a global dynamics model into two stages: (a) learning a context latent vector that captures the local dynamics, then (b) predicting the next state conditioned on it
Improving Generalization in Reinforcement Learning with Mixture RegularizationmixregNeurIPS20train agents on a mixture of observations from different training environments and imposes linearity constraints on the observation interpolations and the supervision (e.g. associated reward) interpolations
Instance based Generalization in Reinforcement LearningIPAENeurIPS20formalize the concept of training levels as instances and show that this instance-based view is fully consistent with the standard POMDP formulation; provide generalization bounds to the value gap in train and test environments based on the number of training instances, and use insights based on these to improve performance on unseen levels
Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement LearningPSMICLR21incorporate the inherent sequential structure in reinforcement learning into the representation learning process to improve generalization; introduce a theoretically motivated policy similarity metric (PSM) for measuring behavioral similarity between states
Generalization in Reinforcement Learning by Soft Data AugmentationSODAICRA21imposes a soft constraint on the encoder that aims to maximize the mutual information between latent representations of augmented and non-augmented data,
Augmented World Models Facilitate Zero-Shot Dynamics Generalization From a Single Offline EnvironmentAugWMICML21consider the setting named "dynamics generalization from a single offline environment" and concentrate on the zero-shot performance to unseen dynamics; propose dynamics augmentation for model based offline RL; propose a simple self-supervised context adaptation reward-free algorithm
Decoupling Value and Policy for Generalization in Reinforcement LearningIDAACICML21decouples the optimization of the policy and value function, using separate networks to model them; introduce an auxiliary loss which encourages the representation to be invariant to task-irrelevant properties of the environment
Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial ObservabilityLEEPNeurIPS21generalisation in RL induces implicit partial observability; propose LEEP to use an ensemble of policies to approximately learn the Bayes-optimal policy for maximizing test-time performance
Automatic Data Augmentation for Generalization in Reinforcement LearningDrACNeurIPS21focus on automatic data augmentation based two novel regularization terms for the policy and value function
When Is Generalizable Reinforcement Learning Tractable?----NeurIPS21propose Weak Proximity and Strong Proximity for theoretically analyzing the generalisation of RL
A Survey of Generalisation in Deep Reinforcement Learning----arxiv21provide a unifying formalism and terminology for discussing different generalisation problems
Cross-Trajectory Representation Learning for Zero-Shot Generalization in RLCTRLICLR22consider zero-shot generalization (ZSG); use self-supervised learning to learn a representation across tasks
The Role of Pretrained Representations for the OOD Generalization of RL Agents----ICLR22train 240 representations and 11,520 downstream policies and systematically investigate their performance under a diverse range of distribution shifts; find that a specific representation metric that measures the generalization of a simple downstream proxy task reliably predicts the generalization of downstream RL agents under the broad spectrum of OOD settings considered here
Generalisation in Lifelong Reinforcement Learning through Logical Composition----ICLR22e leverage logical composition in reinforcement learning to create a framework that enables an agent to autonomously determine whether a new task can be immediately solved using its existing abilities, or whether a task-specific skill should be learned
Local Feature Swapping for Generalization in Reinforcement LearningCLOPICLR22introduce a new regularization technique consisting of channel-consistent local permutations of the feature maps
A Generalist AgentGatoarxiv2205slide
Towards Safe Reinforcement Learning via Constraining Conditional Value at RiskCPPOIJCAI22find the connection between modifying observations and dynamics, which are structurally different
CtrlFormer: Learning Transferable State Representation for Visual Control via TransformerCtrlFormerICML22jointly learns self-attention mechanisms between visual tokens and policy tokens among different control tasks, where multitask representation can be learned and transferred without catastrophic forgetting
Learning Dynamics and Generalization in Reinforcement Learning----ICML22show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training, and at the same time induces the second-order effect of discouraging generalization
Improving Policy Optimization with Generalist-Specialist LearningGSLICML22hope to utilize experiences from the specialists to aid the policy optimization of the generalist; propose the phenomenon “catastrophic ignorance” in multi-task learning
DRIBO: Robust Deep Reinforcement Learning via Multi-View Information BottleneckDRIBOICML22learn robust representations that encode only task-relevant information from observations based on the unsupervised multi-view setting; introduce a novel contrastive version of the Multi-View Information Bottleneck (MIB) objective for temporal data
Generalizing Goal-Conditioned Reinforcement Learning with Variational Causal ReasoningGRADERNeurIPS22use the causal graph as a latent variable to reformulate the GCRL problem and then derive an iterative training framework from solving this problem
Rethinking Value Function Learning for Generalization in Reinforcement LearningDCPG, DDCPGNeurIPS22consider to train agents on multiple training environments to improve observational generalization performance; identify that the value network in the multiple-environment setting is more challenging to optimize; propose regularization methods that penalize large estimates of the value network for preventing overfitting
Masked Autoencoding for Scalable and Generalizable Decision MakingMaskDPNeurIPS22employ a masked autoencoder (MAE) to state-action trajectories for reinforcement learning (RL) and behavioral cloning (BC) and gain the capability of zero-shot transfer to new tasks
Pre-Trained Image Encoder for Generalizable Visual Reinforcement LearningPIE-GNeurIPS22find that the early layers in an ImageNet pre-trained ResNet model could provide rather generalizable representations for visual RL
Look where you look! Saliency-guided Q-networks for visual RL tasksSGQNNeurIPS22propose that a good visual policy should be able to identify which pixels are important for its decision; preserve this identification of important sources of information across images
Human-Timescale Adaptation in an Open-Ended Task SpaceAdAarXiv 2301demonstrate that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans
In-context Reinforcement Learning with Algorithm DistillationADICLR23 oralpropose Algorithm Distillation for distilling reinforcement learning (RL) algorithms into neural networks by modeling their training histories with a causal sequence model
Performance Bounds for Model and Policy Transfer in Hidden-parameter MDPs----ICLR23show that, given a fixed amount of pretraining data, agents trained with more variations are able to generalize better; find that increasing the capacity of the value and policy network is critical to achieve good performance
Investigating Multi-task Pretraining and Generalization in Reinforcement Learning----ICLR23find that, given a fixed amount of pretraining data, agents trained with more variations are able to generalize better; this advantage can still be present after fine-tuning for 200M environment frames than when doing zero-shot transfer
Cross-domain Random Pre-training with Prototypes for Reinforcement LearningCRPTproarXiv2302use prototypical representation learning with a novel intrinsic loss to pre-train an effective and generic encoder across different domains
Task Aware Dreamer for Task Generalization in Reinforcement LearningTADarXiv2303propose Task Distribution Relevance to capture the relevance of the task distribution quantitatively; propose TAD to use world models to improve task generalization via encoding reward signals into policies
The Benefits of Model-Based Generalization in Reinforcement Learning----ICML23provide theoretical and empirical insight into when, and how, we can expect data generated by a learned model to be useful
Multi-Environment Pretraining Enables Transfer to Action Limited DatasetsALPTICML23given n source environments with fully action labelled dataset, consider offline RL in the target environment with a small action labelled dataset and a large dataset without action labels; utilize inverse dynamics model to learn a representation that generalizes well to the limited action data from the target environment
In-Context Reinforcement Learning for Variable Action SpacesHeadless-ADICML24extend Algorithm Distillation to environments with variable discrete action spaces

RL with Transformer

TitleMethodConferenceDescription
Stabilizing transformers for reinforcement learningGTrXLICML20stabilizing training with a reordering of the layer normalization coupled with the addition of a new gating mechanism to key points in the submodules of the transformer
Decision Transformer: Reinforcement Learning via Sequence ModelingDTNeurIPS21regard RL as a sequence generation task and use transformer to generate (return-to-go, state, action, return-to-go, ...); there is not explicit optimization process; evaluate on Offline RL
Offline Reinforcement Learning as One Big Sequence Modeling ProblemTTNeurIPS21regard RL as a sequence generation task and use transformer to generate (s_0^0, ..., s_0^N, a_0^0, ..., a_0^M, r_0, ...); use beam search to inference; evaluate on imitation learning, goal-conditioned RL and Offline RL
Can Wikipedia Help Offline Reinforcement Learning?ChibiTarxiv2201demonstrate that pre-training on autoregressively modeling natural language provides consistent performance gains when compared to the Decision Transformer on both the popular OpenAI Gym and Atari
Online Decision TransformerODTICML22 oralblends offline pretraining with online finetuning in a unified framework; use sequence-level entropy regularizers in conjunction with autoregressive modeling objectives for sample-efficient exploration and finetuning
Prompting Decision Transformer for Few-shot Policy GeneralizationICML22
Multi-Game Decision Transformers----NeurIPS22show that a single transformer-based model trained purely offline can play a suite of up to 46 Atari games simultaneously at close-to-human performance
Grounding Large Language Models in Interactive Environments with Online Reinforcement LearningGLAMICML23consider an agent using an LLM as a policy that is progressively updated as the agent interacts with the environment, leveraging online Reinforcement Learning to improve its performance to solve goals

Tutorial and Lesson

Tutorial and Lesson
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto
Introduction to Reinforcement Learning with David Silver
Deep Reinforcement Learning, CS285
Deep Reinforcement Learning and Control, CMU 10703
RLChina

ICLR22

PaperType
Bootstrapped Meta-Learningoral
The Information Geometry of Unsupervised Reinforcement Learningoral
SO(2)-Equivariant Reinforcement Learningspotlight
CoBERL: Contrastive BERT for Reinforcement Learningspotlight
Understanding and Preventing Capacity Loss in Reinforcement Learningspotlight
On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learningspotlight
Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstrationspotlight
Sample Efficient Deep Reinforcement Learning via Uncertainty Estimationspotlight
Generative Planning for Temporally Coordinated Exploration in Reinforcement Learningspotlight
When should agents explore?spotlight
Revisiting Design Choices in Model-Based Offline Reinforcement Learningspotlight
DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularizationspotlight
Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning spotlight
COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimationspotlight
Value Gradient weighted Model-Based Reinforcement Learningspotlight
Constrained Policy Optimization via Bayesian World Modelsspotlight
Cross-Trajectory Representation Learning for Zero-Shot Generalization in RLposter
The Role of Pretrained Representations for the OOD Generalization of RL Agentsposter
Generalisation in Lifelong Reinforcement Learning through Logical Compositionposter
Local Feature Swapping for Generalization in Reinforcement Learningposter
Policy Smoothing for Provably Robust Reinforcement Learningposter
CROP: Certifying Robust Policies for Reinforcement Learning through Functional Smoothingposter
Model-Based Offline Meta-Reinforcement Learning with Regularizationposter
Skill-based Meta-Reinforcement Learningposter
Hindsight Foresight Relabeling for Meta-Reinforcement Learningposter
CoMPS: Continual Meta Policy Searchposter
Learning a subspace of policies for online adaptation in Reinforcement Learningposter
Pessimistic Model-based Offline Reinforcement Learning under Partial Coverageposter
Pareto Policy Pool for Model-based Offline Reinforcement Learningposter
Offline Reinforcement Learning with Value-based Episodic Memoryposter
Offline reinforcement learning with implicit Q-learningposter
On-Policy Model Errors in Reinforcement Learningposter
Maximum Entropy RL (Provably) Solves Some Robust RL Problems poster
Maximizing Ensemble Diversity in Deep Reinforcement Learningposter
Maximum Entropy RL (Provably) Solves Some Robust RL Problems poster
Learning Generalizable Representations for Reinforcement Learning via Adaptive Meta-learner of Behavioral Similaritiesposter
Lipschitz Constrained Unsupervised Skill Discoveryposter
Learning more skills through optimistic explorationposter

ICML22

PaperType
Online Decision Transformeroral
The Unsurprising Effectiveness of Pre-Trained Vision Models for Controloral
The Importance of Non-Markovianity in Maximum State Entropy Explorationoral
Planning with Diffusion for Flexible Behavior Synthesisoral
Adversarially Trained Actor Critic for Offline Reinforcement Learningoral
Learning Bellman Complete Representations for Offline Policy Evaluationoral
Offline RL Policies Should Be Trained to be Adaptiveoral
Large Batch Experience Replayoral
Do Differentiable Simulators Give Better Gradients for Policy Optimization?oral
Federated Reinforcement Learning: Communication-Efficient Algorithms and Convergence Analysisoral
An Analytical Update Rule for General Policy Optimizationoral
Generalised Policy Improvement with Geometric Policy Compositionoral
Prompting Decision Transformer for Few-shot Policy Generalizationposter
CtrlFormer: Learning Transferable State Representation for Visual Control via Transformerposter
Learning Dynamics and Generalization in Reinforcement Learningposter
Improving Policy Optimization with Generalist-Specialist Learningposter
DRIBO: Robust Deep Reinforcement Learning via Multi-View Information Bottleneckposter
Policy Gradient Method For Robust Reinforcement Learningposter
SAUTE RL: Toward Almost Surely Safe Reinforcement Learning Using State Augmentationposter
Constrained Variational Policy Optimization for Safe Reinforcement Learningposter
Robust Deep Reinforcement Learning through Bootstrapped Opportunistic Curriculumposter
Distributionally Robust Q-Learningposter
Robust Meta-learning with Sampling Noise and Label Noise via Eigen-Reptileposter
DRIBO: Robust Deep Reinforcement Learning via Multi-View Information Bottleneckposter
Model-based Meta Reinforcement Learning using Graph Structured Surrogate Models and Amortized Policy Searchposter
Meta-Learning Hypothesis Spaces for Sequential Decision-makingposter
Biased Gradient Estimate with Drastic Variance Reduction for Meta Reinforcement Learningposter
Transformers are Meta-Reinforcement Learnersposter
Offline Meta-Reinforcement Learning with Online Self-Supervisionposter
Regularizing a Model-based Policy Stationary Distribution to Stabilize Offline Reinforcement Learningposter
Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexityposter
How to Leverage Unlabeled Data in Offline Reinforcement Learning?poster
On the Role of Discount Factor in Offline Reinforcement Learningposter
Model Selection in Batch Policy Optimizationposter
Koopman Q-learning: Offline Reinforcement Learning via Symmetries of Dynamicsposter
Robust Task Representations for Offline Meta-Reinforcement Learning via Contrastive Learningposter
Pessimism meets VCG: Learning Dynamic Mechanism Design via Offline Reinforcement Learningposter
Showing Your Offline Reinforcement Learning Work: Online Evaluation Budget Mattersposter
Constrained Offline Policy Optimizationposter
DreamerPro: Reconstruction-Free Model-Based Reinforcement Learning with Prototypical Representationsposter
Towards Evaluating Adaptivity of Model-Based Reinforcement Learning Methodsposter
Reinforcement Learning with Action-Free Pre-Training from Videosposter
Denoised MDPs: Learning World Models Better Than the World Itselfposter
Temporal Difference Learning for Model Predictive Controlposter
Causal Dynamics Learning for Task-Independent State Abstractionposter
Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Errorposter
Adaptive Model Design for Markov Decision Processposter
Stabilizing Off-Policy Deep Reinforcement Learning from Pixelsposter
Understanding Policy Gradient Algorithms: A Sensitivity-Based Approachposter
Mirror Learning: A Unifying Framework of Policy Optimisationposter
Continuous Control with Action Quantization from Demonstrationsposter
Off-Policy Fitted Q-Evaluation with Differentiable Function Approximators: Z-Estimation and Inference Theoryposter
A Temporal-Difference Approach to Policy Gradient Estimationposter
The Primacy Bias in Deep Reinforcement Learningposter
Optimizing Sequential Experimental Design with Deep Reinforcement Learningposter
The Geometry of Robust Value Functionsposter
Direct Behavior Specification via Constrained Reinforcement Learningposter
Utility Theory for Markovian Sequential Decision Makingposter
Reducing Variance in Temporal-Difference Value Estimation via Ensemble of Deep Networksposter
Unifying Approximate Gradient Updates for Policy Optimizationposter
EqR: Equivariant Representations for Data-Efficient Reinforcement Learningposter
Provable Reinforcement Learning with a Short-Term Memoryposter
Optimal Estimation of Off-Policy Policy Gradient via Double Fitted Iterationposter
Cliff Diving: Exploring Reward Surfaces in Reinforcement Learning Environmentsposter
Lagrangian Method for Q-Function Learning (with Applications to Machine Translation)poster
Learning to Assemble with Large-Scale Structured Reinforcement Learningposter
Addressing Optimism Bias in Sequence Modeling for Reinforcement Learningposter
Off-Policy Reinforcement Learning with Delayed Rewardsposter
Reachability Constrained Reinforcement Learningposter
Flow-based Recurrent Belief State Learning for POMDPsposter
Off-Policy Evaluation for Large Action Spaces via Embeddingsposter
Doubly Robust Distributionally Robust Off-Policy Evaluation and Learningposter
On Well-posedness and Minimax Optimal Rates of Nonparametric Q-function Estimation in Off-policy Evaluationposter
Communicating via Maximum Entropy Reinforcement Learningposter

NeurIPS22

PaperType
Multi-Game Decision Transformersposter
Bootstrapped Transformer for Offline Reinforcement Learningposter
Generalizing Goal-Conditioned Reinforcement Learning with Variational Causal Reasoningposter
Rethinking Value Function Learning for Generalization in Reinforcement Learningposter
Masked Autoencoding for Scalable and Generalizable Decision Makingposter
Pre-Trained Image Encoder for Generalizable Visual Reinforcement Learningposter
GALOIS: Boosting Deep Reinforcement Learning via Generalizable Logic Synthesisposter
Look where you look! Saliency-guided Q-networks for visual RL tasksposter
an adaptive deep rl method for non-stationary environments with piecewise stable contextposter
Model-Based Offline Reinforcement Learning with Pessimism-Modulated Dynamics Beliefposter
A Unified Framework for Alternating Offline Model Training and Policy Learningposter
Bidirectional Learning for Offline Infinite-width Model-based Optimizationposter
DASCO: Dual-Generator Adversarial Support Constrained Offline Reinforcement Learningposter
Supported Policy Optimization for Offline Reinforcement Learningposter
Why So Pessimistic? Estimating Uncertainties for Offline RL through Ensembles, and Why Their Independence Mattersposter
Oracle Inequalities for Model Selection in Offline Reinforcement Learningposter
Mildly Conservative Q-Learning for Offline Reinforcement Learningposter
A Policy-Guided Imitation Approach for Offline Reinforcement Learningposter
Bootstrapped Transformer for Offline Reinforcement Learningposter
LobsDICE: Offline Learning from Observation via Stationary Distribution Correction Estimationposter
Latent-Variable Advantage-Weighted Policy Optimization for Offline RLposter
How Far I'll Go: Offline Goal-Conditioned Reinforcement Learning via f-Advantage Regressionposter
NeoRL: A Near Real-World Benchmark for Offline Reinforcement Learningposter
When does return-conditioned supervised learning work for offline reinforcement learning?poster
Bellman Residual Orthogonalization for Offline Reinforcement Learningposter
Oracle Inequalities for Model Selection in Offline Reinforcement Learningposter
Mismatched no More: Joint Model-Policy Optimization for Model-Based RLposter
When to Update Your Model: Constrained Model-based Reinforcement Learningposter
Bayesian Optimistic Optimization: Optimistic Exploration for Model-Based Reinforcement Learningposter
Model-based Lifelong Reinforcement Learning with Bayesian Explorationposter
Plan to Predict: Learning an Uncertainty-Foreseeing Model for Model-Based Reinforcement Learningposter
data-driven model-based optimization via invariant representation learningposter
Reinforcement Learning with Non-Exponential Discountingposter
Reinforcement Learning with Neural Radiance Fieldsposter
Recursive Reinforcement Learningposter
Challenging Common Assumptions in Convex Reinforcement Learningposter
Explicable Policy Searchposter
On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgettingposter
When to Ask for Help: Proactive Interventions in Autonomous Reinforcement Learningposter
Adaptive Bio-Inspired Fish Simulation with Deep Reinforcement Learningposter
Reinforcement Learning in a Birth and Death Process: Breaking the Dependence on the State Spaceposter
Discovered Policy Optimisationposter
Faster Deep Reinforcement Learning with Slower Online Networkposter
exploration-guided reward shaping for reinforcement learning under sparse rewardsposter
Large-Scale Retrieval for Reinforcement Learningposter
Sustainable Online Reinforcement Learning for Auto-biddingposter
LECO: Learnable Episodic Count for Task-Specific Intrinsic Rewardposter
DNA: Proximal Policy Optimization with a Dual Network Architectureposter
Faster Deep Reinforcement Learning with Slower Online Networkposter
Online Reinforcement Learning for Mixed Policy Scopesposter
ProtoX: Explaining a Reinforcement Learning Agent via Prototypingposter
Hardness in Markov Decision Processes: Theory and Practiceposter
Robust Phi-Divergence MDPsposter
On the convergence of policy gradient methods to Nash equilibria in general stochastic gamesposter
A Unified Off-Policy Evaluation Approach for General Value Functionposter
Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement Learningposter
Continuous Deep Q-Learning in Optimal Control Problems: Normalized Advantage Functions Analysisposter
Parametrically Retargetable Decision-Makers Tend To Seek Powerposter
Batch size-invariance for policy optimizationposter
Trust Region Policy Optimization with Optimal Transport Discrepancies: Duality and Algorithm for Continuous Actionsposter
Adaptive Interest for Emphatic Reinforcement Learningposter
The Nature of Temporal Difference Errors in Multi-step Distributional Reinforcement Learningposter
Reincarnating Reinforcement Learning: Reusing Prior Computation to Accelerate Progressposter
Bayesian Risk Markov Decision Processesposter
Explainable Reinforcement Learning via Model Transformsposter
PDSketch: Integrated Planning Domain Programming and Learningposter
Contrastive Learning as Goal-Conditioned Reinforcement Learningposter
Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?poster
Reinforcement Learning with Automated Auxiliary Loss Searchposter
Mask-based Latent Reconstruction for Reinforcement Learningposter
Iso-Dream: Isolating Noncontrollable Visual Dynamics in World Modelsposter
Learning General World Models in a Handful of Reward-Free Deploymentsposter
Learning Robust Dynamics through Variational Sparse Gatingposter
A Mixture of Surprises for Unsupervised Reinforcement Learningposter
Unsupervised Reinforcement Learning with Contrastive Intrinsic Controlposter
Unsupervised Skill Discovery via Recurrent Skill Trainingposter
A Unified Off-Policy Evaluation Approach for General Value Functionposter
The Pitfalls of Regularizations in Off-Policy TD Learningposter
Off-Policy Evaluation for Action-Dependent Non-Stationary Environmentsposter
Local Metric Learning for Off-Policy Evaluation in Contextual Bandits with Continuous Actionsposter
Off-Policy Evaluation with Policy-Dependent Optimization Responseposter

ICLR23

PaperType
Dichotomy of Control: Separating What You Can Control from What You Cannotoral
In-context Reinforcement Learning with Algorithm Distillationoral
Is Conditional Generative Modeling all you need for Decision Making?oral
Offline Q-learning on Diverse Multi-Task Data Both Scales And Generalizesoral
Confidence-Conditioned Value Functions for Offline Reinforcement Learningoral
Extreme Q-Learning: MaxEnt RL without Entropyoral
Sparse Q-Learning: Offline Reinforcement Learning with Implicit Value Regularizationoral
Transformers are Sample Efficient World Modelsoral
Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrieroral
Guarded Policy Optimization with Imperfect Online Demonstrationsspotlight
Towards Interpretable Deep Reinforcement Learning with Human-Friendly Prototypesspotlight
Pink Noise Is All You Need: Colored Noise Exploration in Deep Reinforcement Learningspotlight
DEP-RL: Embodied Exploration for Reinforcement Learning in Overactuated and Musculoskeletal Systemsspotlight
The In-Sample Softmax for Offline Reinforcement Learningspotlight
Benchmarking Offline Reinforcement Learning on Real-Robot Hardwarespotlight
Choreographer: Learning and Adapting Skills in Imaginationspotlight
Towards Universal Visual Reward and Representation via Value-Implicit Pre-Trainingspotlight
Decision Transformer under Random Frame Droppingposter
Hyper-Decision Transformer for Efficient Online Policy Adaptationposter
Preference Transformer: Modeling Human Preferences using Transformers for RLposter
On the Data-Efficiency with Contrastive Image Transformation in Reinforcement Learningposter
Can Agents Run Relay Race with Strangers? Generalization of RL to Out-of-Distribution Trajectoriesposter
Performance Bounds for Model and Policy Transfer in Hidden-parameter MDPsposter
Investigating Multi-task Pretraining and Generalization in Reinforcement Learningposter
Priors, Hierarchy, and Information Asymmetry for Skill Transfer in Reinforcement Learningposter
On the Robustness of Safe Reinforcement Learning under Observational Perturbationsposter
Distributional Meta-Gradient Reinforcement Learningposter
Conservative Bayesian Model-Based Value Expansion for Offline Policy Optimizationposter
Value Memory Graph: A Graph-Structured World Model for Offline Reinforcement Learningposter
Efficient Offline Policy Optimization with a Learned Modelposter
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learningposter
Offline Reinforcement Learning via High-Fidelity Generative Behavior Modelingposter
Decision S4: Efficient Sequence-Based RL via State Spaces Layersposter
Behavior Proximal Policy Optimizationposter
Learning Achievement Structure for Structured Exploration in Domains with Sparse Rewardposter
Explaining RL Decisions with Trajectoriesposter
User-Interactive Offline Reinforcement Learningposter
Pareto-Efficient Decision Agents for Offline Multi-Objective Reinforcement Learningposter
Offline RL for Natural Language Generation with Implicit Language Q Learningposter
In-sample Actor Critic for Offline Reinforcement Learningposter
Harnessing Mixed Offline Reinforcement Learning Datasets via Trajectory Weightingposter
Mind the Gap: Offline Policy Optimizaiton for Imperfect Rewardsposter
When Data Geometry Meets Deep Function: Generalizing Offline Reinforcement Learningposter
MAHALO: Unifying Offline Reinforcement Learning and Imitation Learning from Observationsposter
Transformer-based World Models Are Happy With 100k Interactionsposter
Dynamic Update-to-Data Ratio: Minimizing World Model Overfittingposter
Evaluating Long-Term Memory in 3D Mazesposter
Making Better Decision by Directly Planning in Continuous Controlposter
HiT-MDP: Learning the SMDP option framework on MDPs with Hidden Temporal Embeddingsposter
Diminishing Return of Value Expansion Methods in Model-Based Reinforcement Learningposter
Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objectiveposter
SpeedyZero: Mastering Atari with Limited Data and Timeposter
Efficient Deep Reinforcement Learning Requires Regulating Statistical Overfittingposter
Replay Memory as An Empirical MDP: Combining Conservative Estimation with Experience Replayposter
Greedy Actor-Critic: A New Conditional Cross-Entropy Method for Policy Improvementposter
Reward Design with Language Modelsposter
Solving Continuous Control via Q-learningposter
Wasserstein Auto-encoded MDPs: Formal Verification of Efficiently Distilled RL Policies with Many-sided Guaranteesposter
Quality-Similar Diversity via Population Based Reinforcement Learningposter
Human-level Atari 200x fasterposter
Policy Expansion for Bridging Offline-to-Online Reinforcement Learningposter
Improving Deep Policy Gradients with Value Function Searchposter
Memory Gym: Partially Observable Challenges to Memory-Based Agentsposter
Hybrid RL: Using both offline and online data can make RL efficientposter
POPGym: Benchmarking Partially Observable Reinforcement Learningposter
Critic Sequential Monte Carloposter
Revocable Deep Reinforcement Learning with Affinity Regularization for Outlier-Robust Graph Matchingposter
Provable Unsupervised Data Sharing for Offline Reinforcement Learningposter
Discovering Policies with DOMiNO: Diversity Optimization Maintaining Near Optimalityposter
Latent Variable Representation for Reinforcement Learningposter
Spectral Decomposition Representation for Reinforcement Learningposter
Behavior Prior Representation learning for Offline Reinforcement Learningposter
Become a Proficient Player with Limited Data through Watching Pure Videosposter
Variational Latent Branching Model for Off-Policy Evaluationposter

ICML23

PaperType
On the Power of Pre-training for Generalization in RL: Provable Benefits and Hardnessoral
AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Plannersoral
Reparameterized Policy Learning for Multimodal Trajectory Optimizationoral
Mastering the Unsupervised Reinforcement Learning Benchmark from Pixelsoral
The Dormant Neuron Phenomenon in Deep Reinforcement Learningoral
Efficient RL via Disentangled Environment and Agent Representationsoral
On the Statistical Benefits of Temporal Difference Learningoral
Warm-Start Actor-Critic: From Approximation Error to Sub-optimality Gaporal
Reinforcement Learning from Passive Data via Latent Intentionsoral
Subequivariant Graph Reinforcement Learning in 3D Environmentsoral
Representation Learning with Multi-Step Inverse Kinematics: An Efficient and Optimal Approach to Rich-Observation RLoral
Flipping Coins to Estimate Pseudocounts for Exploration in Reinforcement Learningoral
Settling the Reward Hypothesisoral
Information-Theoretic State Space Model for Multi-View Reinforcement Learningoral
Mastering the Unsupervised Reinforcement Learning Benchmark from Pixelsoral
Learning Belief Representations for Partially Observable Deep RLposter
Internally Rewarded Reinforcement Learningposter
Active Policy Improvement from Multiple Black-box Oraclesposter
When is Realizability Sufficient for Off-Policy Reinforcement Learning?poster
The Statistical Benefits of Quantile Temporal-Difference Learning for Value Estimationposter
Hyperparameters in Reinforcement Learning and How To Tune Themposter
Langevin Thompson Sampling with Logarithmic Communication: Bandits and Reinforcement Learningposter
Correcting discount-factor mismatch in on-policy policy gradient methodsposter
Masked Trajectory Models for Prediction, Representation, and Controlposter
Off-Policy Average Reward Actor-Critic with Deterministic Policy Searchposter
TGRL: An Algorithm for Teacher Guided Reinforcement Learningposter
LIV: Language-Image Representations and Rewards for Robotic Controlposter
Stein Variational Goal Generation for adaptive Exploration in Multi-Goal Reinforcement Learningposter
Emergence of Adaptive Circadian Rhythms in Deep Reinforcement Learningposter
Explaining Reinforcement Learning with Shapley Valuesposter
Reinforcement Learning Can Be More Efficient with Multiple Rewardsposter
Performative Reinforcement Learningposter
Truncating Trajectories in Monte Carlo Reinforcement Learningposter
ReLOAD: Reinforcement Learning with Optimistic Ascent-Descent for Last-Iterate Convergence in Constrained MDPsposter
Low-Switching Policy Gradient with Exploration via Online Sensitivity Samplingposter
Hyperbolic Diffusion Embedding and Distance for Hierarchical Representation Learningposter
Revisiting Domain Randomization via Relaxed State-Adversarial Policy Optimizationposter
Parallel $Q$-Learning: Scaling Off-policy Reinforcement Learning under Massively Parallel Simulationposter
LESSON: Learning to Integrate Exploration Strategies for Reinforcement Learning via an Option Frameworkposter
Graph Reinforcement Learning for Network Control via Bi-Level Optimizationposter
Stochastic Policy Gradient Methods: Improved Sample Complexity for Fisher-non-degenerate Policiesposter
Reinforcement Learning with History Dependent Dynamic Contextsposter
Efficient Online Reinforcement Learning with Offline Dataposter
Variance Control for Distributional Reinforcement Learningposter
Hindsight Learning for MDPs with Exogenous Inputsposter
RLang: A Declarative Language for Describing Partial World Knowledge to Reinforcement Learning Agentsposter
Scalable Safe Policy Improvement via Monte Carlo Tree Searchposter
Bayesian Reparameterization of Reward-Conditioned Reinforcement Learning with Energy-based Modelsposter
Understanding the Complexity Gains of Single-Task RL with a Curriculumposter
PPG Reloaded: An Empirical Study on What Matters in Phasic Policy Gradientposter
On Many-Actions Policy Gradientposter
Multi-task Hierarchical Adversarial Inverse Reinforcement Learningposter
Cell-Free Latent Go-Exploreposter
Trustworthy Policy Learning under the Counterfactual No-Harm Criterionposter
Reachability-Aware Laplacian Representation in Reinforcement Learningposter
Interactive Object Placement with Reinforcement Learningposter
Leveraging Offline Data in Online Reinforcement Learningposter
Reinforcement Learning with General Utilities: Simpler Variance Reduction and Large State-Action Spaceposter
DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithmposter
Scaling Laws for Reward Model Overoptimizationposter
SNeRL: Semantic-aware Neural Radiance Fields for Reinforcement Learningposter
Set-membership Belief State-based Reinforcement Learning for POMDPsposter
Robust Satisficing MDPsposter
Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modelingposter
Quantum Policy Gradient Algorithm with Optimized Action Decodingposter
For Pre-Trained Vision Models in Motor Control, Not All Policy Learning Methods are Created Equalposter
Model-Free Robust Average-Reward Reinforcement Learningposter
Fair and Robust Estimation of Heterogeneous Treatment Effects for Policy Learningposter
Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learningposter
Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisonsposter
Social learning spontaneously emerges by searching optimal heuristics with deep reinforcement learningposter
Bigger, Better, Faster: Human-level Atari with human-level efficiencyposter
Posterior Sampling for Deep Reinforcement Learningposter
Model-based Reinforcement Learning with Scalable Composite Policy Gradient Estimatorsposter
Go Beyond Imagination: Maximizing Episodic Reachability with World Modelsposter
Simplified Temporal Consistency Reinforcement Learningposter
Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World Modellingposter
Demonstration-free Autonomous Reinforcement Learning via Implicit and Bidirectional Curriculumposter
Curious Replay for Model-based Adaptationposter
Multi-View Masked World Models for Visual Robotic Manipulationposter
Automatic Intrinsic Reward Shaping for Exploration in Deep Reinforcement Learningposter
Curiosity in Hindsight: Intrinsic Exploration in Stochastic Environmentsposter
Representations and Exploration for Deep Reinforcement Learning using Singular Value Decompositionposter
Grounding Large Language Models in Interactive Environments with Online Reinforcement Learningposter
Distilling Internet-Scale Vision-Language Models into Embodied Agentsposter
VIMA: Robot Manipulation with Multimodal Promptsposter
Future-conditioned Unsupervised Pretraining for Decision Transformerposter
Emergent Agentic Transformer from Chain of Hindsight Experienceposter
The Benefits of Model-Based Generalization in Reinforcement Learningposter
Multi-Environment Pretraining Enables Transfer to Action Limited Datasetsposter
On Pre-Training for Visuo-Motor Control: Revisiting a Learning-from-Scratch Baselineposter
Unsupervised Skill Discovery for Learning Shared Structures across Changing Environmentsposter
An Investigation into Pre-Training Object-Centric Representations for Reinforcement Learningposter
Guiding Pretraining in Reinforcement Learning with Large Language Modelsposter
What is Essential for Unseen Goal Generalization of Offline Goal-conditioned RL?poster
Online Prototype Alignment for Few-shot Policy Transferposter
Detecting Adversarial Directions in Deep Reinforcement Learning to Make Robust Decisionsposter
Robust Situational Reinforcement Learning in Face of Context Disturbancesposter
Adversarial Learning of Distributional Reinforcement Learningposter
Towards Robust and Safe Reinforcement Learning with Benign Off-policy Dataposter
Simple Embodied Language Learning as a Byproduct of Meta-Reinforcement Learningposter
ContraBAR: Contrastive Bayes-Adaptive Deep RLposter
Model-based Offline Reinforcement Learning with Count-based Conservatismposter
Model-Bellman Inconsistency for Model-based Offline Reinforcement Learningposter
Learning Temporally Abstract World Models without Online Experimentationposter
Contrastive Energy Prediction for Exact Energy-Guided Diffusion Sampling in Offline Reinforcement Learningposter
MetaDiffuser: Diffusion Model as Conditional Planner for Offline Meta-RLposter
Actor-Critic Alignment for Offline-to-Online Reinforcement Learningposter
Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectoriesposter
Principled Offline RL in the Presence of Rich Exogenous Informationposter
Offline Meta Reinforcement Learning with In-Distribution Online Adaptationposter
Policy Regularization with Dataset Constraint for Offline Reinforcement Learningposter
Supported Trust Region Optimization for Offline Reinforcement Learningposter
Constrained Decision Transformer for Offline Safe Reinforcement Learningposter
PAC-Bayesian Offline Contextual Bandits With Guaranteesposter
Beyond Reward: Offline Preference-guided Policy Optimizationposter
Offline Reinforcement Learning with Closed-Form Policy Improvement Operatorsposter
ChiPFormer: Transferable Chip Placement via Offline Decision Transformerposter
Boosting Offline Reinforcement Learning with Action Preference Queryposter
Jump-Start Reinforcement Learningposter
Investigating the role of model-based learning in exploration and transferposter
STEERING : Stein Information Directed Exploration for Model-Based Reinforcement Learningposter
Predictable MDP Abstraction for Unsupervised Model-Based RLposter
The Virtues of Laziness in Model-based RL: A Unified Objective and Algorithmsposter
On the Importance of Feature Decorrelation for Unsupervised Representation Learning in Reinforcement Learningposter
CLUTR: Curriculum Learning via Unsupervised Task Representation Learningposter
Controllability-Aware Unsupervised Skill Discoveryposter
Behavior Contrastive Learning for Unsupervised Skill Discoveryposter
Variational Curriculum Reinforcement Learning for Unsupervised Discovery of Skillsposter
Bootstrapped Representations in Reinforcement Learningposter
Representation-Driven Reinforcement Learningposter
Improved Policy Evaluation for Randomized Trials of Algorithmic Resource Allocationposter
An Instrumental Variable Approach to Confounded Off-Policy Evaluationposter
Semiparametrically Efficient Off-Policy Evaluation in Linear Markov Decision Processesposter
Automatic Intrinsic Reward Shaping for Exploration in Deep Reinforcement Learningposter
Curiosity in Hindsight: Intrinsic Exploration in Stochastic Environmentsposter

NeurIPS23

PaperType
Learning Generalizable Agents via Saliency-guided Features Decorrelationoral
Understanding Expertise through Demonstrations: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learningoral
When Demonstrations meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learningoral
DiffuseBot: Breeding Soft Robots With Physics-Augmented Generative Diffusion Modelsoral
When Do Transformers Shine in RL? Decoupling Memory from Credit Assignmentoral
Bridging RL Theory and Practice with the Effective Horizonoral
SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasksspotlight
RePo: Resilient Model-Based Reinforcement Learning by Regularizing Posterior Predictabilityspotlight
Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Explorationspotlight
Conditional Mutual Information for Disentangled Representations in Reinforcement Learningspotlight
Optimistic Natural Policy Gradient: a Simple Efficient Policy Optimization Framework for Online RLspotlight
Double Gumbel Q-Learningspotlight
Future-Dependent Value-Based Off-Policy Evaluation in POMDPsspotlight
Supervised Pretraining Can Learn In-Context Reinforcement Learningspotlight
Train Once, Get a Family: State-Adaptive Balances for Offline-to-Online Reinforcement Learningspotlight
Constraint-Conditioned Policy Optimization for Versatile Safe Reinforcement Learningposter
Explore to Generalize in Zero-Shot RLposter
Dynamics Generalisation in Reinforcement Learning via Adaptive Context-Aware Policiesposter
Reining Generalization in Offline Reinforcement Learning via Representation Distinctionposter
Contrastive Retrospection: honing in on critical steps for rapid learning and generalization in RLposter
Doubly Robust Augmented Transfer for Meta-Reinforcement Learningposter
Recurrent Hypernetworks are Surprisingly Strong in Meta-RLposter
Parameterizing Non-Parametric Meta-Reinforcement Learning Tasks via Subtask Decompositionposter
One Risk to Rule Them All: A Risk-Sensitive Perspective on Model-Based Offline Reinforcement Learningposter
Efficient Diffusion Policies For Offline Reinforcement Learningposter
Learning to Influence Human Behavior with Offline Reinforcement Learningposter
Design from Policies: Conservative Test-Time Adaptation for Offline Policy Optimizationposter
SafeDICE: Offline Safe Imitation Learning with Non-Preferred Demonstrationsposter
Constrained Policy Optimization with Explicit Behavior Density For Offline Reinforcement Learningposter
Conservative State Value Estimation for Offline Reinforcement Learningposter
Offline RL with Discrete Proxy Representations for Generalizability in POMDPsposter
Context Shift Reduction for Offline Meta-Reinforcement Learningposter
Mutual Information Regularized Offline Reinforcement Learningposter
Recovering from Out-of-sample States via Inverse Dynamics in Offline Reinforcement Learningposter
Percentile Criterion Optimization in Offline Reinforcement Learningposter
Language Models Meet World Models: Embodied Experiences Enhance Language Modelsposter
Action Inference by Maximising Evidence: Zero-Shot Imitation from Observation with World Modelsposter
Facing off World Model Backbones: RNNs, Transformers, and S4poster
Efficient Exploration in Continuous-time Model-based Reinforcement Learningposter
Model-Based Reparameterization Policy Gradient Methods: Theory and Practical Algorithmsposter
Learning to Discover Skills through Guidanceposter
Creating Multi-Level Skill Hierarchies in Reinforcement Learningposter
Unsupervised Behavior Extraction via Random Intent Priorsposter
MIMEx: Intrinsic Rewards from Masked Input Modelingposter
f-Policy Gradients: A General Framework for Goal-Conditioned RL using f-Divergencesposter
Prediction and Control in Continual Reinforcement Learningposter
Residual Q-Learning: Offline and Online Policy Customization without Valueposter
Small batch deep reinforcement learningposter
Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPsposter
Is RLHF More Difficult than Standard RL? A Theoretical Perspectiveposter
Reflexion: language agents with verbal reinforcement learningposter
Generative Modelling of Stochastic Actions with Arbitrary Constraints in Reinforcement Learningposter
Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learningposter
Direct Preference-based Policy Optimization without Reward Modelingposter
Learning to Modulate pre-trained Models in RLposter
Ignorance is Bliss: Robust Control via Information Gatingposter
Marginal Density Ratio for Off-Policy Evaluation in Contextual Banditsposter
Model-Free Reinforcement Learning with the Decision-Estimation Coefficientposter
Optimal and Fair Encouragement Policy Evaluation and Learningposter
BIRD: Generalizable Backdoor Detection and Removal for Deep Reinforcement Learningposter
Probabilistic Inference in Reinforcement Learning Done Rightposter
Reference-Based POMDPsposter
Persuading Farsighted Receivers in MDPs: the Power of Honestyposter
Distributional Policy Evaluation: a Maximum Entropy approach to Representation Learningposter
Structured State Space Models for In-Context Reinforcement Learningposter
An Alternative to Variance: Gini Deviation for Risk-averse Policy Gradientposter
Distributional Model Equivalence for Risk-Sensitive Reinforcement Learningposter
PLASTIC: Improving Input and Label Plasticity for Sample Efficient Reinforcement Learningposter
Hybrid Policy Optimization from Imperfect Demonstrationsposter
Policy Optimization in a Noisy Neighborhood: On Return Landscapes in Continuous Controlposter
Semantic HELM: A Human-Readable Memory for Reinforcement Learningposter
A Definition of Continual Reinforcement Learningposter
Fast Bellman Updates for Wasserstein Distributionally Robust MDPsposter
Policy Gradient for Rectangular Robust Markov Decision Processesposter
Discovering Hierarchical Achievements in Reinforcement Learning via Contrastive Learningposter
Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approachposter
Model-Free Active Exploration in Reinforcement Learningposter
Self-Supervised Reinforcement Learning that Transfers using Random Featuresposter
FlowPG: Action-constrained Policy Gradient with Normalizing Flowsposter
Flexible Attention-Based Multi-Policy Fusion for Efficient Deep Reinforcement Learningposter
ODE-based Recurrent Model-free Reinforcement Learning for POMDPsposter
Suggesting Variable Order for Cylindrical Algebraic Decomposition via Reinforcement Learningposter
SPQR: Controlling Q-ensemble Independence with Spiked Random Model for Reinforcement Learningposter
CaMP: Causal Multi-policy Planning for Interactive Navigation in Multi-room Scenesposter
POMDP Planning for Object Search in Partially Unknown Environmentposter
Unified Off-Policy Learning to Rank: a Reinforcement Learning Perspectiveposter
Natural Actor-Critic for Robust Reinforcement Learning with Function Approximationposter
A Long $N$-step Surrogate Stage Reward for Deep Reinforcement Learningposter
State-Action Similarity-Based Representations for Off-Policy Evaluationposter
Weakly Coupled Deep Q-Networksposter
Large Language Models Are Semi-Parametric Reinforcement Learning Agentsposter
The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learningposter
Online Nonstochastic Model-Free Reinforcement Learningposter
When is Agnostic Reinforcement Learning Statistically Tractable?poster
Bayesian Risk-Averse Q-Learning with Streaming Observationsposter
Resetting the Optimizer in Deep RL: An Empirical Studyposter
Optimistic Exploration in Reinforcement Learning Using Symbolic Model Estimatesposter
Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithmsposter
Regularity as Intrinsic Reward for Free Playposter
TACO: Temporal Latent Action-Driven Contrastive Loss for Visual Reinforcement Learningposter
Policy Optimization for Continuous Reinforcement Learningposter
Active Observing in Continuous-time Controlposter
Replicable Reinforcement Learningposter
On the Importance of Exploration for Generalization in Reinforcement Learningposter
Monte Carlo Tree Search with Boltzmann Explorationposter
Iterative Reachability Estimation for Safe Reinforcement Learningposter
Discovering General Reinforcement Learning Algorithms with Adversarial Environment Designposter
Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?poster
Inverse Dynamics Pretraining Learns Good Representations for Multitask Imitationposter
Interpretable Reward Redistribution in Reinforcement Learning: A Causal Approachposter
Contrastive Modules with Temporal Attention for Multi-Task Reinforcement Learningposter
Sample-Efficient and Safe Deep Reinforcement Learning via Reset Deep Ensemble Agentsposter
Distributional Pareto-Optimal Multi-Objective Reinforcement Learningposter
Efficient Policy Adaptation with Contrastive Prompt Ensemble for Embodied Agentsposter
Efficient Potential-based Exploration in Reinforcement Learning using Inverse Dynamic Bisimulation Metricposter
Iteratively Learn Diverse Strategies with State Distance Informationposter
Accelerating Reinforcement Learning with Value-Conditional State Entropy Explorationposter
Gradient Informed Proximal Policy Optimizationposter
The Curious Price of Distributional Robustness in Reinforcement Learning with a Generative Modelposter
Optimal Treatment Allocation for Efficient Policy Evaluation in Sequential Decision Makingposter
Thinker: Learning to Plan and Actposter
Learning Better with Less: Effective Augmentation for Sample-Efficient Visual Reinforcement Learningposter
Reinforcement Learning with Simple Sequence Priorsposter
Can Pre-Trained Text-to-Image Models Generate Visual Goals for Reinforcement Learning?poster
Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced Datasetsposter
CQM: Curriculum Reinforcement Learning with a Quantized World Modelposter
H-InDex: Visual Reinforcement Learning with Hand-Informed Representations for Dexterous Manipulationposter
Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuningposter
Anytime-Competitive Reinforcement Learning with Policy Priorposter
Budgeting Counterfactual for Offline RLposter
Fractal Landscapes in Policy Optimizationposter
Goal-Conditioned Predictive Coding for Offline Reinforcement Learningposter
For SALE: State-Action Representation Learning for Deep Reinforcement Learningposter
Inverse Reinforcement Learning with the Average Reward Criterionposter
Revisiting the Minimalist Approach to Offline Reinforcement Learningposter
Adversarial Model for Offline Reinforcement Learningposter
Supported Value Regularization for Offline Reinforcement Learningposter
PID-Inspired Inductive Biases for Deep Reinforcement Learning in Partially Observable Control Tasksposter
How to Fine-tune the Model: Unified Model Shift and Model Bias Policy Optimizationposter
Learning from Visual Observation via Offline Pretrained State-to-Go Transformerposter
Describe, Explain, Plan and Select: Interactive Planning with LLMs Enables Open-World Multi-Task Agentsposter
Robust Knowledge Transfer in Tiered Reinforcement Learningposter
Train Hard, Fight Easy: Robust Meta Reinforcement Learningposter
Task-aware world model learning with meta weighting via bi-level optimizationposter
Video Prediction Models as Rewards for Reinforcement Learningposter
Synthetic Experience Replayposter
Policy Finetuning in Reinforcement Learning via Design of Experiments using Offline Dataposter
Learning Dynamic Attribute-factored World Models for Efficient Multi-object Reinforcement Learningposter
Learning World Models with Identifiable Factorizationposter
Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learningposter
Inverse Preference Learning: Preference-based RL without a Reward Functionposter
Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RLposter
Latent exploration for Reinforcement Learningposter
Large Language Models can Implement Policy Iterationposter
Generalized Weighted Path Consistency for Mastering Atari Gamesposter
Learning Environment-Aware Affordance for 3D Articulated Object Manipulation under Occlusionsposter
Accelerating Value Iteration with Anchoringposter
Reduced Policy Optimization for Continuous Control with Hard Constraintsposter
State Regularized Policy Optimization on Data with Dynamics Shiftposter
Offline Reinforcement Learning with Differential Privacyposter
Understanding and Addressing the Pitfalls of Bisimulation-based Representations in Offline Reinforcement Learningposter

ICLR24

PaperType
Predictive auxiliary objectives in deep RL mimic learning in the brainoral
Pre-Training Goal-based Models for Sample-Efficient Reinforcement Learningoral
METRA: Scalable Unsupervised RL with Metric-Aware Abstractionoral
ASID: Active Exploration for System Identification and Reconstruction in Robotic Manipulationoral
Mastering Memory Tasks with World Modelsoral
Generalized Policy Iteration using Tensor Approximation for Hybrid Controlspotlight
Selective Visual Representations Improve Convergence and Generalization for Embodied AIspotlight
AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agentsspotlight
Confronting Reward Model Overoptimization with Constrained RLHFspotlight
Proximal Policy Gradient Arborescence for Quality Diversity Reinforcement Learningspotlight
Improving Offline RL by Blending Heuristicsspotlight
Decision ConvFormer: Local Filtering in MetaFormer is Sufficient for Decision Makingspotlight
Tool-Augmented Reward Modelingspotlight
Reward-Consistent Dynamics Models are Strongly Generalizable for Offline Reinforcement Learningspotlight
Dual RL: Unification and New Methods for Reinforcement and Imitation Learningspotlight
Stabilizing Contrastive RL: Techniques for Robotic Goal Reaching from Offline Dataspotlight
Safe RLHF: Safe Reinforcement Learning from Human Feedbackspotlight
Cross$Q$: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicityspotlight
Blending Imitation and Reinforcement Learning for Robust Policy Improvementspotlight
Unlocking the Power of Representations in Long-term Novelty-based Explorationspotlight
Spatially-Aware Transformers for Embodied Agentsspotlight
Learning to Act without Actionsspotlight
Towards Principled Representation Learning from Videos for Reinforcement Learningspotlight
TorchRL: A data-driven decision-making library for PyTorchspotlight
Towards Robust Offline Reinforcement Learning under Diverse Data Corruptionspotlight
Learning Hierarchical World Models with Adaptive Temporal Abstractions from Discrete Latent Dynamicsspotlight
Text2Reward: Reward Shaping with Language Models for Reinforcement Learningspotlight
Robotic Task Generalization via Hindsight Trajectory Sketchesspotlight
Submodular Reinforcement Learningspotlight
Query-Policy Misalignment in Preference-Based Reinforcement Learningspotlight
Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policiesspotlight
GenSim: Generating Robotic Simulation Tasks via Large Language Modelsspotlight
Entity-Centric Reinforcement Learning for Object Manipulation from Pixelsspotlight
Illusory Attacks: Detectability Matters in Adversarial Attacks on Sequential Decision-Makersspotlight
Addressing Signal Delay in Deep Reinforcement Learningspotlight
DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimizationspotlight
Task Adaptation from Skills: Information Geometry, Disentanglement, and New Objectives for Unsupervised Reinforcement Learningspotlight
$\mathcal{B}$-Coder: On Value-Based Deep Reinforcement Learning for Program Synthesisspotlight
Physics-Regulated Deep Reinforcement Learning: Invariant Embeddingsspotlight
Retroformer: Retrospective Large Language Agents with Policy Gradient Optimizationspotlight
Learning to Act from Actionless Videos through Dense Correspondencesspotlight
CivRealm: A Learning and Reasoning Odyssey in Civilization for Decision-Making Agentsspotlight
TD-MPC2: Scalable, Robust World Models for Continuous Controlspotlight
Universal Humanoid Motion Representations for Physics-Based Controlspotlight
Adaptive Rational Activations to Boost Deep Reinforcement Learningspotlight
Robust Adversarial Reinforcement Learning via Bounded Rationality Curriculaspotlight
Locality Sensitive Sparse Encoding for Learning World Models Onlineposter
On Representation Complexity of Model-based and Model-free Reinforcement Learningposter
Policy Rehearsing: Training Generalizable Policies for Reinforcement Learningposter
What Matters to You? Towards Visual Representation Alignment for Robot Learningposter
Improving Language Models with Advantage-based Offline Policy Gradientsposter
Training Diffusion Models with Reinforcement Learningposter
The Trickle-down Impact of Reward Inconsistency on RLHFposter
Maximum Entropy Model Correction in Reinforcement Learningposter
Tree Search-Based Policy Optimization under Stochastic Execution Delayposter
Offline RL with Observation Histories: Analyzing and Improving Sample Complexityposter
Understanding Hidden Context in Preference Learning: Consequences for RLHFposter
Eureka: Human-Level Reward Design via Coding Large Language Modelsposter
Retrieval-Guided Reinforcement Learning for Boolean Circuit Minimizationposter
Score Models for Offline Goal-Conditioned Reinforcement Learningposter
Contrastive Difference Predictive Codingposter
Hindsight PRIORs for Reward Learning from Human Preferencesposter
Reward Model Ensembles Help Mitigate Overoptimizationposter
Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion Modelposter
Compositional Conservatism: A Transductive Approach in Offline Reinforcement Learningposter
Flow to Better: Offline Preference-based Reinforcement Learning via Preferred Trajectory Generationposter
PAE: Reinforcement Learning from External Knowledge for Efficient Explorationposter
Identifying Policy Gradient Subspacesposter
PARL: A Unified Framework for Policy Alignment in Reinforcement Learningposter
SafeDreamer: Safe Reinforcement Learning with World Modelsposter
Vanishing Gradients in Reinforcement Finetuning of Language Modelsposter
Goodhart's Law in Reinforcement Learningposter
Score Regularized Policy Optimization through Diffusion Behaviorposter
Making RL with Preference-based Feedback Efficient via Randomizationposter
Consistency Models as a Rich and Efficient Policy Class for Reinforcement Learningposter
Contrastive Preference Learning: Learning from Human Feedback without Reinforcement Learningposter
Privileged Sensing Scaffolds Reinforcement Learningposter
Learning Planning Abstractions from Languageposter
CrossLoco: Human Motion Driven Control of Legged Robots via Guided Unsupervised Reinforcement Learningposter
Efficient Dynamics Modeling in Interactive Environments with Koopman Theoryposter
Jumanji: a Diverse Suite of Scalable Reinforcement Learning Environments in JAXposter
Searching for High-Value Molecules Using Reinforcement Learning and Transformersposter
Privately Aligning Language Models with Reinforcement Learningposter
The HIM Solution for Legged Locomotion: Minimal Sensors, Efficient Learning, and Substantial Agilityposter
S$2$AC: Energy-Based Reinforcement Learning with Stein Soft Actor Criticposter
Replay across Experiments: A Natural Extension of Off-Policy RLposter
Piecewise Linear Parametrization of Policies: Towards Interpretable Deep Reinforcement Learningposter
Time-Efficient Reinforcement Learning with Stochastic Stateful Policiesposter
Open the Black Box: Step-based Policy Updates for Temporally-Correlated Episodic Reinforcement Learningposter
On Trajectory Augmentations for Off-Policy Evaluationposter
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulationposter
Understanding the Effects of RLHF on LLM Generalisation and Diversityposter
Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confoundingposter
Prioritized Soft Q-Decomposition for Lexicographic Reinforcement Learningposter
The Curse of Diversity in Ensemble-Based Explorationposter
Off-Policy Primal-Dual Safe Reinforcement Learningposter
STARC: A General Framework For Quantifying Differences Between Reward Functionsposter
Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learningposter
Discovering Temporally-Aware Reinforcement Learning Algorithmsposter
Revisiting Data Augmentation in Deep Reinforcement Learningposter
Reward-Free Curricula for Training Robust World Modelsposter
CPPO: Continual Learning for Reinforcement Learning with Human Feedbackposter
A Study of Generalization in Offline Reinforcement Learningposter
RLIF: Interactive Imitation Learning as Reinforcement Learningposter
Uncertainty-aware Constraint Inference in Inverse Constrained Reinforcement Learningposter
Towards Imitation Learning to Branch for MIP: A Hybrid Reinforcement Learning based Sample Augmentation Approachposter
Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluationposter
Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimizationposter
Free from Bellman Completeness: Trajectory Stitching via Model-based Return-conditioned Supervised Learningposter
Revisiting Plasticity in Visual Reinforcement Learning: Data, Modules and Training Stagesposter
Robot Fleet Learning via Policy Mergingposter
Improving Intrinsic Exploration by Creating Stationary Objectivesposter
Motif: Intrinsic Motivation from Artificial Intelligence Feedbackposter
Understanding when Dynamics-Invariant Data Augmentations Benefit Model-free Reinforcement Learning Updatesposter
RLCD: Reinforcement Learning from Contrastive Distillation for LM Alignmentposter
Reasoning with Latent Diffusion in Offline Reinforcement Learningposter
Belief-Enriched Pessimistic Q-Learning against Adversarial State Perturbationsposter
Reward Design for Justifiable Sequential Decision-Makingposter
MAMBA: an Effective World Model Approach for Meta-Reinforcement Learningposter
LOQA: Learning with Opponent Q-Learning Awarenessposter
Intelligent Switching for Reset-Free RLposter
True Knowledge Comes from Practice: Aligning Large Language Models with Embodied Environments via Reinforcement Learningposter
Skill Machines: Temporal Logic Skill Composition in Reinforcement Learningposter
Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedbackposter
Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learningposter
DittoGym: Learning to Control Soft Shape-Shifting Robotsposter
Decoupling regularization from the action spaceposter
Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasksposter
Robust Model Based Reinforcement Learning Using $\mathcal{L}_1$ Adaptive Controlposter
DMBP: Diffusion model-based predictor for robust offline reinforcement learning against state observation perturbationsposter
LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agentsposter
Integrating Planning and Deep Reinforcement Learning via Automatic Induction of Task Substructuresposter
Closing the Gap between TD Learning and Supervised Learning - A Generalisation Point of View.poster
COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RLposter
$\pi$2vec: Policy Representation with Successor Featuresposter
Task Planning for Visual Room Rearrangement under Partial Observabilityposter
DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothingposter
Meta Inverse Constrained Reinforcement Learning: Convergence Guarantee and Generalization Analysisposter
Cleanba: A Reproducible and Efficient Distributed Reinforcement Learning Platformposter
Consciousness-Inspired Spatio-Temporal Abstractions for Better Generalization in Reinforcement Learningposter
Function-space Parameterization of Neural Networks for Sequential Learningposter
When should we prefer Decision Transformers for Offline Reinforcement Learning?poster
Bridging State and History Representations: Understanding Self-Predictive RLposter
Embodied Active Defense: Leveraging Recurrent Feedback to Counter Adversarial Patchesposter
Stylized Offline Reinforcement Learning: Extracting Diverse High-Quality Behaviors from Heterogeneous Datasetsposter
Pre-training with Synthetic Data Helps Offline Reinforcement Learningposter
Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RLposter
A Simple Solution for Offline Imitation from Observations and Examples with Possibly Incomplete Trajectoriesposter
Offline Imitation Learning with Variational Counterfactual Reasoningposter
Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manualsposter
Reinforcement Learning with Fast and Forgetful Memoryposter
Active Vision Reinforcement Learning under Limited Visual Observabilityposter
Sequential Preference Ranking for Efficient Reinforcement Learning from Human Feedbackposter
Hierarchical Adaptive Value Estimation for Multi-modal Visual Reinforcement Learningposter
Elastic Decision Transformerposter
Importance-aware Co-teaching for Offline Model-based Optimizationposter
Parallel-mentoring for Offline Model-based Optimizationposter
Accountability in Offline Reinforcement Learning: Explaining Decisions with a Corpus of Examplesposter

ICML24

PaperType
Stop Regressing: Training Value Functions via Classification for Scalable Deep RLoral
Position: Automatic Environment Shaping is the Next Frontier in RLoral
ACE: Off-Policy Actor-Critic with Causality-Aware Entropy Regularizationoral
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Studyoral
SAPG: Split and Aggregate Policy Gradientsoral
Environment Design for Inverse Reinforcement Learningoral
OMPO: A Unified Framework for RL under Policy and Dynamics Shiftsoral
Learning to Model the World With Languageoral
Offline Actor-Critic Reinforcement Learning Scales to Large Modelsoral
Self-Composing Policies for Scalable Continual Reinforcement Learningoral
Genie: Generative Interactive Environmentsoral
Unsupervised Zero-Shot Reinforcement Learning via Functional Reward Encodingsspotlight
Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learningspotlight
Mixtures of Experts Unlock Parameter Scaling for Deep RLspotlight
RICE: Breaking Through the Training Bottlenecks of Reinforcement Learning with Explanationspotlight
Code as Reward: Empowering Reinforcement Learning with VLMsspotlight
EfficientZero V2: Mastering Discrete and Continuous Control with Limited Dataspotlight
Behavior Generation with Latent Actionsspotlight
Overestimation, Overfitting, and Plasticity in Actor-Critic: the Bitter Lesson of Reinforcement Learningposter
Hard Tasks First: Multi-Task Reinforcement Learning Through Task Schedulingposter
Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learningposter
Meta-Learners for Partially-Identified Treatment Effects Across Multiple Environmentsposter
How to Explore with Belief: State Entropy Maximization in POMDPsposter
PIPER: Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabelingposter
Iterative Regularized Policy Optimization with Imperfect Demonstrationsposter
Fourier Controller Networks for Real-Time Decision-Making in Embodied Learningposter
Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learningposter
AD3: Implicit Action is the Key for World Models to Distinguish the Diverse Visual Distractorsposter
DRED: Zero-Shot Transfer in Reinforcement Learning via Data-Regularised Environment Designposter
Adapting Pretrained ViTs with Convolution Injector for Visuo-Motor Controlposter
Degeneration-free Policy Optimization: RL Fine-Tuning for Language Models without Degenerationposter
Energy-Guided Diffusion Sampling for Offline-to-Online Reinforcement Learningposter
RVI-SAC: Average Reward Off-Policy Deep Reinforcement Learningposter
Offline Transition Modeling via Contrastive Energy Learningposter
Model-based Reinforcement Learning for Confounded POMDPsposter
Revisiting Scalable Hessian Diagonal Approximations for Applications in Reinforcement Learningposter
Absolute Policy Optimization: Enhancing Lower Probability Bound of Performance with High Confidenceposter
Meta-Reinforcement Learning Robust to Distributional Shift Via Performing Lifelong In-Context Learningposter
DIDI: Diffusion-Guided Diversity for Offline Behavioral Generationposter
When Do Skills Help Reinforcement Learning? A Theoretical Analysis of Temporal Abstractionsposter
BeigeMaps: Behavioral Eigenmaps for Reinforcement Learning from Imagesposter
Physics-Informed Neural Network Policy Iteration: Algorithms, Convergence, and Verificationposter
RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedbackposter
RoboDreamer: Learning Compositional World Models for Robot Imaginationposter
Investigating Pre-Training Objectives for Generalization in Vision-Based Reinforcement Learningposter
RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulationposter
Coprocessor Actor Critic: A Model-Based Reinforcement Learning Approach For Adaptive Brain Stimulationposter
GFlowNet Training by Policy Gradientsposter
Value-Evolutionary-Based Reinforcement Learningposter
PEARL: Zero-shot Cross-task Preference Alignment and Robust Reward Learning for Robotic Manipulationposter
Feasibility Consistent Representation Learning for Safe Reinforcement Learningposter
Distilling Morphology-Conditioned Hypernetworks for Efficient Universal Morphology Controlposter
Constrained Reinforcement Learning Under Model Mismatchposter
Discovering Multiple Solutions from a Single Task in Offline Reinforcement Learningposter
Learning to Stabilize Online Reinforcement Learning in Unbounded State Spacesposter
Learning to Play Atari in a World of Tokensposter
Breaking the Barrier: Enhanced Utility and Robustness in Smoothed DRL Agentsposter
Probabilistic Constrained Reinforcement Learning with Formal Interpretabilityposter
Hieros: Hierarchical Imagination on Structured State Space Sequence World Modelsposter
Random Latent Exploration for Deep Reinforcement Learningposter
Model-based Reinforcement Learning for Parameterized Action Spacesposter
Confidence Aware Inverse Constrained Reinforcement Learningposter
Averaging n-step Returns Reduces Variance in Reinforcement Learningposter
Position: A Call for Embodied AIposter
Just Cluster It: An Approach for Exploration in High-Dimensions using Clustering and Pre-Trained Representationsposter
The Max-Min Formulation of Multi-Objective Reinforcement Learning: From Theory to a Model-Free Algorithmposter
Skill Set Optimization: Reinforcing Language Model Behavior via Transferable Skillsposter
Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RLposter
Sequence Compression Speeds Up Credit Assignment in Reinforcement Learningposter
Seizing Serendipity: Exploiting the Value of Past Success in Off-Policy Actor-Criticposter
Generalization to New Sequential Decision Making Tasks with In-Context Learningposter
Simple Ingredients for Offline Reinforcement Learningposter
Efficient World Models with Context-Aware Tokenizationposter
In value-based deep reinforcement learning, a pruned network is a good networkposter
Probabilistic Subgoal Representations for Hierarchical Reinforcement Learningposter
Premier-TACO is a Few-Shot Policy Learner: Pretraining Multitask Representation via Temporal Action-Driven Contrastive Lossposter
Understanding and Diagnosing Deep Reinforcement Learningposter
To the Max: Reinventing Reward in Reinforcement Learningposter
ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantagesposter
Stochastic Q-learning for Large Discrete Action Spacesposter
Feasible Reachable Policy Iterationposter
Position: Video as the New Language for Real-World Decision Makingposter
Learning Latent Dynamic Robust Representations for World Modelsposter
Reinformer: Max-Return Sequence Modeling for Offline RLposter
Rethinking Transformers in Solving POMDPsposter
Single-Trajectory Distributionally Robust Reinforcement Learningposter
Trust the Model Where It Trusts Itself - Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaptionposter
A Minimaximalist Approach to Reinforcement Learning from Human Feedbackposter
EvoRainbow: Combining Improvements in Evolutionary Reinforcement Learning for Policy Searchposter
SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasetsposter
Adaptive-Gradient Policy Optimization: Enhancing Policy Learning in Non-Smooth Differentiable Simulationsposter
Dense Reward for Free in Reinforcement Learning from Human Feedbackposter
Configurable Mirror Descent: Towards a Unification of Decision Makingposter
Policy Learning for Balancing Short-Term and Long-Term Rewardsposter
Reward Shaping for Reinforcement Learning with An Assistant Reward Agentposter
Distributional Bellman Operators over Mean Embeddingsposter
SiT: Symmetry-invariant Transformers for Generalisation in Reinforcement Learningposter
Geometric Active Exploration in Markov Decision Processes: the Benefit of Abstractionposter
Learning a Diffusion Model Policy from Rewards via Q-Score Matchingposter
ACPO: A Policy Optimization Algorithm for Average MDPs with Constraintsposter
Position: Benchmarking is Limited in Reinforcement Learning Researchposter
Learning Constraints from Offline Demonstrations via Superior Distribution Correction Estimationposter
Augmenting Decision with Hypothesis in Reinforcement Learningposter
SHINE: Shielding Backdoors in Deep Reinforcement Learningposter
Learning Coverage Paths in Unknown Environments with Deep Reinforcement Learningposter
Improving Token-Based World Models with Parallel Observation Predictionposter
Learning to Explore in POMDPs with Informational Rewardsposter
Stealthy Imitation: Reward-guided Environment-free Policy Stealingposter
FuRL: Visual-Language Models as Fuzzy Rewards for Reinforcement Learningposter
Enhancing Value Function Estimation through First-Order State-Action Dynamics in Offline Reinforcement Learningposter
In-Context Reinforcement Learning for Variable Action Spacesposter
Information-Directed Pessimism for Offline Reinforcement Learningposter
PcLast: Discovering Plannable Continuous Latent Statesposter
Bayesian Design Principles for Offline-to-Online Reinforcement Learningposter
Adaptive Advantage-Guided Policy Regularization for Offline Reinforcement Learningposter
ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RLposter
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedbackposter
Langevin Policy for Safe Reinforcement Learningposter
Reflective Policy Optimizationposter
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraintposter
Contrastive Representation for Data Filtering in Cross-Domain Offline Reinforcement Learningposter
Position: Foundation Agents as the Paradigm Shift for Decision Makingposter
Principled Penalty-based Methods for Bilevel Reinforcement Learning and RLHFposter
Do Transformer World Models Give Better Policy Gradients?poster
Boosting Reinforcement Learning with Strongly Delayed Feedback Through Auxiliary Short Delaysposter
Zero-Shot Reinforcement Learning via Function Encodersposter
3D-VLA: A 3D Vision-Language-Action Generative World Modelposter
SF-DQN: Provable Knowledge Transfer using Successor Feature for Deep Reinforcement Learningposter
In-Context Decision Transformer: Reinforcement Learning via Hierarchical Chain-of-Thoughtposter
Quality-Diversity Actor-Critic: Learning High-Performing and Diverse Behaviors via Value and Successor Features Criticsposter
Listwise Reward Estimation for Offline Preference-based Reinforcement Learningposter
Position: Scaling Simulation is Neither Necessary Nor Sufficient for In-the-Wild Robot Manipulationposter
Hybrid Reinforcement Learning from Offline Observation Aloneposter
Is Inverse Reinforcement Learning Harder than Standard Reinforcement Learning? A Theoretical Perspectiveposter
Regularized Q-learning through Robust Averagingposter
Cross-Domain Policy Adaptation by Capturing Representation Mismatchposter
HarmoDT: Harmony Multi-Task Decision Transformer for Offline Reinforcement Learningposter
Foundation Policies with Hilbert Representationsposter
Subequivariant Reinforcement Learning in 3D Multi-Entity Physical Environmentsposter
LLM-Empowered State Representation for Reinforcement Learningposter
Prompt-based Visual Alignment for Zero-shot Policy Transferposter
An Embodied Generalist Agent in 3D Worldposter
Q-value Regularized Transformer for Offline Reinforcement Learningposter
Highway Value Iteration Networksposter
Robust Inverse Constrained Reinforcement Learning under Model Misspecificationposter
Exploration and Anti-Exploration with Distributional Random Network Distillationposter
Policy-conditioned Environment Models are More Generalizableposter
Constrained Ensemble Exploration for Unsupervised Skill Discoveryposter
DiffStitch: Boosting Offline Reinforcement Learning with Diffusion-based Trajectory Stitchingposter
Rethinking Decision Transformer via Hierarchical Reinforcement Learningposter
Learning Cognitive Maps from Transformer Representations for Efficient Planning in Partially Observed Environmentsposter
HarmonyDream: Task Harmonization Inside World Modelsposter
Advancing DRL Agents in Commercial Fighting Games: Training, Integration, and Agent-Human Alignmentposter
Offline Imitation from Observation via Primal Wasserstein State Occupancy Matchingposter
Fine-Grained Causal Dynamics Learning with Quantization for Improving Robustness in Reinforcement Learningposter
Switching the Loss Reduces the Cost in Batch Reinforcement Learningposter
Think Before You Act: Decision Transformers with Working Memoryposter

NeurIPS24

PaperType
Maximum Entropy Inverse Reinforcement Learning of Diffusion Models with Energy-Based Modelsoral
Improving Environment Novelty Quantification for Effective Unsupervised Environment Designoral
RL-GPT: Integrating Reinforcement Learning and Code-as-policyoral
Optimizing Automatic Differentiation with Deep Reinforcement Learningspotlight
Bigger, Regularized, Optimistic: scaling for compute and sample efficient continuous controlspotlight
Can Learned Optimization Make Reinforcement Learning Less Difficult?spotlight
Goal Reduction with Loop-Removal Accelerates RL and Models Human Brain Activity in Goal-Directed Learningspotlight
BricksRL: A Platform for Democratizing Robotics and Reinforcement Learning Research and Education with LEGOspotlight
Humanoid Locomotion as Next Token Predictionspotlight
Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learningspotlight
Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Controlspotlight
A Study of Plasticity Loss in On-Policy Deep Reinforcement Learningspotlight
Diffusion for World Modeling: Visual Details Matter in Atarispotlight
Exclusively Penalized Q-learning for Offline Reinforcement Learningspotlight
DiffTORI: Differentiable Trajectory Optimization for Deep Reinforcement and Imitation Learningspotlight
Variational Delayed Policy Optimizationspotlight
Rethinking Exploration in Reinforcement Learning with Effective Metric-Based Exploration Bonusspotlight
Towards an Information Theoretic Framework of Context-Based Offline Meta-Reinforcement Learningspotlight
Reinforcement Learning Gradients as Vitamin for Online Finetuning Decision Transformersspotlight
The Value of Reward Lookahead in Reinforcement Learningspotlight
PEAC: Unsupervised Pre-training for Cross-Embodiment Reinforcement Learningposter
Reward Machines for Deep RL in Noisy and Uncertain Environmentsposter
Provable Partially Observable Reinforcement Learning with Privileged Informationposter
Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learningposter
SimPO: Simple Preference Optimization with a Reference-Free Rewardposter
Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learningposter
Model-based Diffusion for Trajectory Optimizationposter
Operator World Models for Reinforcement Learningposter
Foundations of Multivariate Distributional Reinforcement Learningposter
Imitating Language via Scalable Inverse Reinforcement Learningposter
Beyond Optimism: Exploration With Partially Observable Rewardsposter
SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agentsposter
Learning World Models for Unconstrained Goal Navigationposter
Exploring the Edges of Latent State Clusters for Goal-Conditioned Reinforcement Learningposter
Off-Dynamics Reinforcement Learning via Domain Adaptation and Reward Augmented Imitationposter
Constrained Latent Action Policies for Model-Based Offline Reinforcement Learningposter
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedbackposter
Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignmentposter
Normalization and effective learning rates in reinforcement learningposter
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Searchposter
Deep Policy Gradient Methods Without Batch Updates, Target Networks, or Replay Buffersposter
Text-Aware Diffusion for Policy Learningposter
A Tractable Inference Perspective of Offline RLposter
Reinforcing LLM Agents via Policy Optimization with Action Decompositionposter
Parseval Regularization for Continual Reinforcement Learningposter
The Surprising Ineffectiveness of Pre-Trained Visual Representations for Model-Based Reinforcement Learningposter
Speculative Monte-Carlo Tree Searchposter
Safety through feedback in Constrained RLposter
Test Where Decisions Matter: Importance-driven Testing for Deep Reinforcement Learningposter
Skill-aware Mutual Information Optimisation for Zero-shot Generalisation in Reinforcement Learningposter
Entropy-regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learningposter
An Analytical Study of Utility Functions in Multi-Objective Reinforcement Learningposter
Diffusion-DICE: In-Sample Diffusion Guidance for Offline Reinforcement Learningposter
Efficient Recurrent Off-Policy RL Requires a Context-Encoder-Specific Learning Rateposter
Uncertainty-based Offline Variational Bayesian Reinforcement Learning for Robustness under Diverse Data Corruptionsposter
Any2Policy: Learning Visuomotor Policy with Any-Modalityposter
Reinforcement Learning with Adaptive Regularization for Safe Control of Critical Systemsposter
Adam on Local Time: Addressing Nonstationarity in RL with Relative Adam Timestepsposter
ROIDICE: Offline Return on Investment Maximization for Efficient Decision Makingposter
Prediction with Action: Visual Policy Learning via Joint Denoising Processposter
Aligning Diffusion Behaviors with Q-functions for Efficient Continuous Controlposter

ICLR25

PaperType
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Styleoral
Diffusion-Based Planning for Autonomous Driving with Flexible Guidanceoral
Learning to Search from Demonstration Sequencesoral
Spread Preference Annotation: Direct Preference Judgment for Efficient LLM Alignmentoral
Interpreting Emergent Planning in Model-Free Reinforcement Learningoral
Kinetix: Investigating the Training of General Agents through Open-Ended Physics-Based Control Tasksoral
OptionZero: Planning with Learned Optionsoral
Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulationoral
Data Scaling Laws in Imitation Learning for Robotic Manipulationoral
More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthinessoral
Geometry-aware RL for Manipulation of Varying Shapes and Deformable Objectsoral
DeepLTL: Learning to Efficiently Satisfy Complex LTL Specifications for Multi-Task RLoral
Training Language Models to Self-Correct via Reinforcement Learningoral
Prioritized Generative Replayoral
Flat Reward in Policy Parameter Space Implies Robust Reinforcement Learningoral
Open-World Reinforcement Learning over Long Short-Term Imaginationoral
Online Preference Alignment for Language Models via Count-based Explorationspotlight
Joint Reward and Policy Learning with Demonstrations and Human Feedback Improves Alignmentspotlight
Correlated Proxies: A New Definition and Improved Mitigation for Reward Hackingspotlight
Online Reinforcement Learning in Non-Stationary Context-Driven Environmentsspotlight
DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedbackspotlight
Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimizationspotlight
TOP-ERL: Transformer-based Off-Policy Episodic Reinforcement Learningspotlight
VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planningspotlight
Multi-Robot Motion Planning with Diffusion Modelsspotlight
Simplifying Deep Temporal Difference Learningspotlight
ODE-based Smoothing Neural Network for Reinforcement Learning Tasksspotlight
MAD-TD: Model-Augmented Data stabilizes High Update Ratio RLspotlight
Stabilizing Reinforcement Learning in Differentiable Multiphysics Simulationspotlight
Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RLspotlight
Learning Transformer-based World Models with Contrastive Predictive Codingspotlight
Towards General-Purpose Model-Free Reinforcement Learningspotlight
Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?spotlight
Accelerating Goal-Conditioned Reinforcement Learning Algorithms and Researchspotlight
SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learningspotlight
Test-time Alignment of Diffusion Models without Reward Over-optimizationspotlight
Mitigating Information Loss in Tree-Based Reinforcement Learning via Direct Optimizationspotlight
What Makes a Good Diffusion Planner for Decision Making?spotlight
ADAM: An Embodied Causal Agent in Open-World Environmentsposter
How to Evaluate Reward Models for RLHFposter
SafeDiffuser: Safe Planning with Diffusion Probabilistic Modelsposter
Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Dataposter
Efficient Reinforcement Learning with Large Language Model Priorsposter
Langevin Soft Actor-Critic: Efficient Exploration through Uncertainty-Driven Critic Learningposter
Efficient Policy Evaluation with Safety Constraint for Reinforcement Learningposter
Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicityposter
Safety-Prioritizing Curricula for Constrained Reinforcement Learningposter
Neural Stochastic Differential Equations for Uncertainty-Aware Offline RLposter
MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximizationposter
SEMDICE: Off-policy State Entropy Maximization via Stationary Distribution Correction Estimationposter
Strategist: Self-improvement of LLM Decision Making via Bi-Level Tree Searchposter

ICML25

PaperType
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agentsoral
Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learningoral
Multi-Turn Code Generation Through Single-Step Rewardsspotlight
Policy-labeled Preference Learning: Is Preference Enough for RLHF?spotlight
Monte Carlo Tree Diffusion for System 2 Planningspotlight
RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learningspotlight
Decision Making under the Exponential Family: Distributionally Robust Optimisation with Bayesian Ambiguity Setsspotlight
Hyperspherical Normalization for Scalable Deep Reinforcement Learningspotlight
The Synergy of LLMs & RL Unlocks Offline Learning of Generalizable Language-Conditioned Policies with Low-fidelity Dataspotlight
Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Dataspotlight
A Unified Theoretical Analysis of Private and Robust Offline Alignment: from RLHF to DPOspotlight
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representationsspotlight
Continual Reinforcement Learning by Planning with Online World Modelsspotlight
Policy Regularization on Globally Accessible States in Cross-Dynamics Reinforcement Learningspotlight
Latent Diffusion Planning for Imitation Learningspotlight
Novelty Detection in Reinforcement Learning with World Modelsspotlight
DPO Meets PPO: Reinforced Token Optimization for RLHFspotlight

NeurIPS25

PaperType
State Entropy Regularization for Robust Reinforcement Learningoral
PRIMT: Preference-based Reinforcement Learning with Multimodal Feedback and Trajectory Synthesis from Foundation Modelsoral
A Clean Slate for Offline Reinforcement Learningoral
1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilitiesoral
QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Trainingoral
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?oral
AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Playspotlight
Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problemsspotlight
Forecasting in Offline Reinforcement Learning for Non-stationary Environmentsspotlight
Counteractive RL: Rethinking Core Principles for Efficient and Scalable Deep Reinforcement Learningspotlight
Reverse Engineering Human Preferences with Reinforcement Learningspotlight
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learningspotlight
d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learningspotlight
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulationspotlight
Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learningspotlight
DeepDiver: Adaptive Web-Search Intensity Scaling via Reinforcement Learningspotlight
DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Modelsspotlight
Composite Flow Matching for Reinforcement Learning with Shifted-Dynamics Dataspotlight
Reinforcement Learning with Imperfect Transition Predictions: A Bellman-Jensen Approachspotlight
Stable Gradients for Stable Learning at Scale in Deep Reinforcement Learningspotlight
AlphaZero Neural Scaling and Zipf's Law: a Tale of Board Games and Power Lawsspotlight
DAPO : Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage-Based Policy Optimizationspotlight
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learningspotlight
CURE: Co-Evolving Coders and Unit Testers via Reinforcement Learningspotlight
To Distill or Decide? Understanding the Algorithmic Trade-off in Partially Observable RLspotlight
Adaptive Neighborhood-Constrained Q Learning for Offline Reinforcement Learningspotlight
To Think or Not To Think: A Study of Thinking in Rule-Based Visual Reinforcement Fine-Tuningspotlight
DexGarmentLab: Dexterous Garment Manipulation Environment with Generalizable Policyspotlight
Q-Insight: Understanding Image Quality via Visual Reinforcement Learningspotlight
Novel Exploration via Orthogonalityposter
Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learningposter
Reinforcement Learning with Backtracking Feedbackposter
Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedbackposter
World-aware Planning Narratives Enhance Large Vision-Language Model Plannerposter
STAR: Efficient Preference-based Reinforcement Learning via Dual Regularizationposter
FairDICE: Fairness-Driven Offline Multi-Objective Reinforcement Learningposter
Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Roboticsposter
On Evaluating Policies for Robust POMDPsposter
Periodic Skill Discoveryposter
Reinforcement Learning with Action Chunkingposter
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learningposter
UFO-RL: Uncertainty-Focused Optimization for Efficient Reinforcement Learning Data Selectionposter
Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Modelsposter
DISCOVER: Automated Curricula for Sparse-Reward Reinforcement Learningposter
EnerVerse: Envisioning Embodied Future Space for Robotics Manipulationposter
Dynamics-Aligned Latent Imagination in Contextual World Models for Zero-Shot Generalizationposter
Off-policy Reinforcement Learning with Model-based Exploration Augmentationposter
IOSTOM: Offline Imitation Learning from Observations via State Transition Occupancy Matchingposter
Tree-Guided Diffusion Plannerposter
Real-Time Execution of Action Chunking Flow Policiesposter
Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMsposter
Behavior Injection: Preparing Language Models for Reinforcement Learningposter
ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learningposter
Coarse-to-fine Q-Network with Action Sequence for Data-Efficient Reinforcement Learningposter
Consistently Simulating Human Personas with Multi-Turn Reinforcement Learningposter
Deep RL Needs Deep Behavior Analysis: Exploring Implicit Planning by Model-Free Agents in Open-Ended Environmentsposter

ICLR26

PaperType
Exploratory Diffusion Model for Unsupervised Reinforcement Learningoral
Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Searchoral
Why DPO is a Misspecified Estimator and How to Fix Itoral
SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safetyoral
Compositional Diffusion with Guided search for Long-Horizon Planningoral
LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contextsoral
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learningoral
Reasoning without Training: Your Base Model is Smarter Than You Thinkoral
Rodrigues Network for Learning Robot Actionsoral
Mean Flow Policy with Instantaneous Velocity Constraint for One-step Action Generationoral
TD-JEPA: Latent-predictive Representations for Zero-Shot Reinforcement Learningoral
WoW!: World Models in a Closed-Loop Worldoral
DiffusionNFT: Online Diffusion Reinforcement with Forward Processoral
Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learningoral
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learningoral
MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Models for Embodied Task Planningoral

关于 About

Related papers for reinforcement learning, including classic papers and latest papers in top conferences
generalization-reinforcement-learningiclr23iclr24iclr25icml22icml23icml24icml25meta-reinforcement-learningmodel-based-rlmodel-free-rlneurips22neurips23neurips24neurips25offline-rlreinforcement-learningreinforcement-learning-papersunsupervised-reinforcement-learning

语言 Languages

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
7
Total Commits
峰值: 3次/周
Less
More

核心贡献者 Contributors