Offline reinforcement learning (RL) focuses on learning policies using static datasets without further exploration. With the introduction of distributional reinforcement learning into offline RL, current methods excel at quantifying the risk and ensuring the security of learned policies. However, these algorithms cannot effectively balance the distribution shift and robustness, and even a minor perturbation in observations can significantly impair policy performance. In this paper, we propose the algorithm of Offline Robustness of Distributional actor-critic Ensemble Reinforcement Learning (ORDER) to improve the robustness of policies. In ORDER, we introduce two approaches to enhance the robustness: 1) introduce the smoothing technique to policies and distribution functions for states near the dataset; 2) strengthen the quantile network. In addition to improving the robustness, we also theoretically prove that ORDER converges to a conservative lower bound, which can alleviate the distribution shift. In our experiments, we validate the effectiveness of ORDER in the D4RL benchmark through comparative experiments and ablation studies.
References
[1]
Lange, S., Gabel, T. and Riedmiller, M. (2012) Batch Reinforcement Learning. In: Wiering, M. and van Otterlo, M., Eds., Reinforcement Learning, Springer, 45-73. https://doi.org/10.1007/978-3-642-27645-3_2
[2]
Wang, R., Wu, Y., Salakhutdinov, R. and Kakade, S. (2021) Instabilities of Offline RL with Pretrained Neural Representation. Proceedings of the 38th International Conference on Machine Learning, Virtual, 18-24 July 2021, 10948-10960.
[3]
Nguyen-Tang, T., Yin, M., Gupta, S., Venkatesh, S. and Arora, R. (2023) On Instance-Dependent Bounds for Offline Reinforcement Learning with Linear Function Approximation. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 9310-9318. https://doi.org/10.1609/aaai.v37i8.26116
[4]
Diehl, C., Sievernich, T.S., Kruger, M., Hoffmann, F. and Bertram, T. (2023) Uncertainty-Aware Model-Based Offline Reinforcement Learning for Automated Driving. IEEE Robotics and Automation Letters, 8, 1167-1174. https://doi.org/10.1109/lra.2023.3236579
[5]
Zhang, Z., Mei, H. and Xu, Y. (2023) Continuous-Time Decision Transformer for Healthcare Applications. Proceedings of the 26th International Conference on Artificial Intelligence and Statistics, Valencia, 25-27 April 2023, 6245-6262.
[6]
Singh, B., Kumar, R. and Singh, V.P. (2022) Reinforcement Learning in Robotic Applications: A Comprehensive Survey. Artificial Intelligence Review, 55, 945-990. https://doi.org/10.1007/s10462-021-09997-9
[7]
Haarnoja, T., Zhou, A., Abbeel, P. and Levine, S. (2018) Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. Proceedings of the 35th International Conference on Machine Learning, Stockholm, 10-15 July 2018, 1861-1870.
[8]
Ashvin, N., Murtaza, D., Abhishek, G. and Sergey, L. (2020) AWAC: Accelerating Online Reinforcement Learning with Offline Datasets. arXiv: 2006.09359. https://doi.org/10.48550/arXiv.2006.09359
[9]
Fujimoto, S., Meger, D. and Precup, D. (2019) Off-Policy Deep Reinforcement Learning without Exploration. Proceedings of the 36th International Conference on Machine Learning, Long Beach, 9-15 June 2019, 2052-2062.
[10]
Fujimoto, S. and Gu, S.S. (2021) A Minimalist Approach to Offline Reinforcement Learning. Advances in Neural Information Processing Systems, 34, 20132-20145.
[11]
Lyu, J., Ma, X., Li, X. and Lu, Z. (2022) Mildly Conservative Q-Learning for Offline Reinforcement Learning. Advances in Neural Information Processing Systems, 35, 1711-1724.
[12]
Dabney, W., Ostrovski, G., Silver, D. and Munos, R. (2018) Implicit Quantile Networks for Distributional Reinforcement Learning. Proceedings of the 35th International Conference on Machine Learning, Stockholm, 10-15 July 2018, 1096-1105.
[13]
Dabney, W., Rowland, M., Bellemare, M. and Munos, R. (2018) Distributional Reinforcement Learning with Quantile Regression. Proceedings of the AAAI Conference on Artificial Intelligence, 32, 2892-2901. https://doi.org/10.1609/aaai.v32i1.11791
[14]
Ma, X., Xia, L., Zhou, Z., Yang, J. and Zhao, Q. (2020) DSAC: Distributional Soft Actor Critic for Risk-Sensitive Reinforcement Learning. arXiv: 2004.14547. https://doi.org/10.48550/arXiv.2004.14547
[15]
Ma, Y., Jayaraman, D. and Bastani, O. (2021) Conservative Offline Distributional Reinforcement Learning. Advances in Neural Information Processing Systems, 34, 19235-19247.
[16]
Bai, C., Xiao, T., Zhu, Z., Wang, L., Zhou, F., Garg, A., et al. (2022) Monotonic Quantile Network for Worst-Case Offline Reinforcement Learning. IEEE Transactions on Neural Networks and Learning Systems, 35, 8954-8968. https://doi.org/10.1109/TNNLS.2022.3217189
[17]
Kumar, A., Zhou, A., Tucker, G. and Levine, S. (2020) Conservative Q-Learning for Offline Reinforcement Learning. Advances in Neural Information Processing Systems, 33, 1179-1191.
[18]
Fu, J., Kumar, A., Nachum, O., Tucker, G. and Levine, S. (2020) D4RL: Datasets for Deep Data-Driven Reinforcement Learning. arXiv: 2004.07219. https://doi.org/10.48550/arXiv.2004.07219
[19]
Bellemare, M.G., Dabney, W. and Munos, R. (2017) A Distributional Perspective on Reinforcement Learning. Proceedings of the 34th International Conference on Machine Learning, Sydney, 6-11 August 2017, 449-458.
[20]
Müller, A. (1997) Integral Probability Metrics and Their Generating Classes of Functions. Advances in Applied Probability, 29, 429-443. https://doi.org/10.2307/1428011
[21]
Shen, Q., Li, Y., Jiang, H., Wang, Z. and Zhao, T. (2020) Deep Reinforcement Learning with Robust and Smooth Policy. Proceedings of the 37th International Conference on Machine Learning, Virtual, 13-18 July 2020, 8707-8718.
[22]
Koenker, R. and Hallock, K.F. (2001) Quantile Regression. Journal of Economic Perspectives, 15, 143-156. https://doi.org/10.1257/jep.15.4.143
[23]
Huber, P.J. (1992) Robust Estimation of a Location Parameter. In: Kotz, S. and Johnson, N.L., Eds., Breakthroughs in Statistics, Springer, 492-518. https://doi.org/10.1007/978-1-4612-4380-9_35
[24]
Wu, Y., Tucker, G. and Nachum, O. (2019) Behavior Regularized Offline Reinforcement Learning. arXiv: 1911.11361. https://doi.org/10.48550/arXiv.1911.11361
[25]
Kidambi, R., Rajeswaran, A., Netrapalli, P. and Joachims. T. (2020) MOReL: Model-Based Offline Reinforcement Learning. Advances in Neural Information Processing Systems, 33, 21810-21823.
[26]
Figueiredo Prudencio, R., Maximo, M.R.O.A. and Colombini, E.L. (2024) A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems. IEEE Transactions on Neural Networks and Learning Systems, 35, 10237-10257. https://doi.org/10.1109/tnnls.2023.3250269
[27]
Sutton, R.S. and Barto, A.G. (2018) Reinforcement Learning: An Introduction. MIT Press.
Singla, A., Rafferty, A.N., Radanovic, G. and Heffernan, N.T. (2021) Reinforcement Learning for Education: Opportunities and Challenges. arXiv:2107.08828. https://doi.org/10.48550/arXiv.2107.08828
[30]
Agarwal, R., Schuurmans, D. and Norouzi, M. (2020) An Optimistic Perspective on Offline Reinforcement Learning. Proceedings of the 37th International Conference on Machine Learning, Vienna, 13-18 July 2020, 104-114.
[31]
Bai, C., Wang, L., Yang, Z., Deng, Z., Garg, A., Liu, P. and Wang, Z. (2022) Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning. arXiv: 2202.11566. https://doi.org/10.48550/arXiv.2202.11566
[32]
Sohn, K., Lee, H. and Yan, X. (2015) Learning Structured Output Representation Using Deep Conditional Generative Models. Advances in Neural Information Processing Systems, 28, 3483-3491.
[33]
Kumar, A., Fu, J., Soh, M., Tucker, G. and Levine, S. (2019) Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. Advances in Neural Information Processing Systems, 32, 11761-11771.
[34]
Panaganti, K., Xu, Z., Kalathil, D. and Ghavamzadeh, M. (2022) Robust Reinforcement Learning Using Offline Data. Advances in Neural Information Processing Systems, 35, 32211-32224.
[35]
Kostrikov, I., Nair, A. and Levine, S. (2021) Offline Reinforcement Learning with Implicit Q-Learning. arXiv: 2110.06169. https://doi.org/10.48550/arXiv.2110.06169
[36]
Yang, R., Bai, C., Ma, X., Wang, Z., Zhang, C. and Han, L. (2022) RORL: Robust Offline Reinforcement Learning via Conservative Smoothing. Advances in Neural Information Processing Systems, 35, 23851-23866.