VOCE: Variational Optimization with Conservative Estimation for Offline Safe Reinforcement Learning
Jiayi Guan,
Guang Chen*, Jiaming Ji, Long Yang, Ao Zhou, Zhijun Li, Changjun Jiang
Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023
PDF /
code /
project /
bibtex
@inproceedings{NEURIPS2023_6a7c2a32,
author = {Guan, Jiayi and Chen, Guang and Ji, Jiaming and Yang, Long and zhou, ao and Li, Zhijun and jiang, changjun},
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Oh and T. Naumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
pages = {33758--33780},
publisher = {Curran Associates, Inc.},
title = {VOCE: Variational Optimization with Conservative Estimation for Offline Safe Reinforcement Learning},
url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/6a7c2a320f5f36bb98f8eb878c6f1180-Paper-Conference.pdf},
volume = {36},
year = {2023}
}
We propose a Variational Optimization with Conservative Estimation algorithm (VOCE) to solve the problem of optimizing safety policies in the offline dataset. Concretely, we reframe the problem of offline safe RL using probabilistic inference, which introduces variational distributions to make the optimization of policies more flexible. Subsequently, we utilize pessimistic estimation methods to estimate the Q-value of cost and reward, which mitigates the extrapolation errors induced by OOD actions. Finally, extensive experiments demonstrate that the VOCE algorithm achieves competitive performance across multiple experimental tasks, particularly outperforming state-of-the-art algorithms in terms of safety.