Batch reinforcement learning with hyperparameter gradients

Byung Jun Lee, Jongmin Lee, Peter Vrancx, Dongho Kim, Kee Eung Kim

Research output: Chapter in Book/Report/Conference proceedingConference contribution


We consider the batch reinforcement learning problem where the agent needs to learn only from a fixed batch of data, without further interaction with the environment. In such a scenario, we want to prevent the optimized policy from deviating too much from the data collection policy since the es_timation becomes highly unstable otherwise due to the off-policy nature of the problem. However, imposing this requirement too strongly will result in a policy that merely follows the data collection policy. Unlike prior work where this trade-off is controlled by hand-tuned hyperparameters, we propose a novel batch reinforcement learning ap_proach, batch optimization of policy and hyper_parameter (BOPAH), that uses a gradient-based optimization of the hyperparameter using held-out data. We show that BOPAH outperforms other batch reinforcement learning algorithms in tabular and continuous control tasks, by finding a good balance to the trade-off between adhering to the data collection policy and pursuing the possible policy improvement.

Original languageEnglish
Title of host publication37th International Conference on Machine Learning, ICML 2020
EditorsHal Daume, Aarti Singh
PublisherInternational Machine Learning Society (IMLS)
Number of pages11
ISBN (Electronic)9781713821120
Publication statusPublished - 2020
Externally publishedYes
Event37th International Conference on Machine Learning, ICML 2020 - Virtual, Online
Duration: 2020 Jul 132020 Jul 18

Publication series

Name37th International Conference on Machine Learning, ICML 2020


Conference37th International Conference on Machine Learning, ICML 2020
CityVirtual, Online

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Human-Computer Interaction
  • Software


Dive into the research topics of 'Batch reinforcement learning with hyperparameter gradients'. Together they form a unique fingerprint.

Cite this