×

Boundedness of iterates in \(Q\)-learning. (English) Zbl 1129.93552

Summary: Reinforcement Learning (RL) is a simulation-based counterpart of stochastic dynamic programming. In recent years, it has been used in solving complex Markov decision problems (MDPs). Watkins’ \(Q\)-Learning is by far the most popular RL algorithm used for solving discounted-reward MDPs. The boundedness of the iterates in \(4Q\)-Learning plays a critical role in its convergence analysis and in making the algorithm stable, which makes it extremely attractive in numerical solutions. Previous results show boundedness asymptotically in an almost sure sense. We present a new result that shows boundedness in an absolute sense under some weaker conditions for the step size. Also, our proof is based on some simple induction arguments.

MSC:

93E35 Stochastic learning and adaptive control
90C15 Stochastic programming
90C39 Dynamic programming
Full Text: DOI

References:

[1] Bertsekas, D.; Tsitsiklis, J., Neuro-dynamic Programming (1996), Athena Scientific: Athena Scientific MA · Zbl 0924.68163
[2] Borkar, V.; Meyn, S., The ODE method for convergence of stochastic approximation and reinforcement learning, SIAM J. Control Optim., 38, 2, 447-469 (2000) · Zbl 0990.62071
[3] Puterman, M., Markov Decision Processes (1994), Wiley: Wiley NY
[4] Robbins, H.; Monro, S., A stochastic approximation method, Ann. Math. Statist., 22, 400-407 (1951) · Zbl 0054.05901
[5] Sutton, R.; Barto, A., Reinforcement Learning: An Introduction (1998), MIT Press: MIT Press Cambridge, MA, USA
[6] Tsitsiklis, J., Asynchronous stochastic approximation and \(Q\)-Learning, Machine Learning, 16, 185-202 (1994) · Zbl 0820.68105
[7] C. Watkins, Learning from delayed rewards, Ph.D. Thesis, Kings College, Cambridge, England, 1989.; C. Watkins, Learning from delayed rewards, Ph.D. Thesis, Kings College, Cambridge, England, 1989.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.