×

Integral \(Q\)-learning and explorized policy iteration for adaptive optimal control of continuous-time linear systems. (English) Zbl 1254.49019

Summary: This paper proposes an integral \(Q\)-learning for Continuous-Time (CT) Linear Time-Invariant (LTI) systems, which solves a linear quadratic regulation (LQR) problem in real time for a given system and a value function, without knowledge about the system dynamics \(A\) and \(B\). Here, \(Q\)-learning is referred to as a family of reinforcement learning methods which find the optimal policy by interaction with an uncertain environment. In the evolution of the algorithm, we first develop an explored Policy Iteration (PI) method which is able to deal with known exploration signals. Then the integral \(Q\)-learning algorithm for CT LTI systems is derived based on this PI and the variants of \(Q\)-functions derived from the singular perturbation of the control input. The proposed \(Q\)-learning scheme evaluates the current value function and the improved control policy at the same time, and are proven to be stable and convergent to the LQ optimal solution, provided that the initial policy is stabilizing. For the proposed algorithms, practical online implementation methods are investigated in terms of Persistency of Excitation (PE) and explorations. Finally, simulation results are provided for the better comparison and verification of the performance.

MSC:

49N10 Linear-quadratic optimal control problems
68T05 Learning and adaptive systems in artificial intelligence
49M30 Other numerical methods in calculus of variations (MSC2010)
Full Text: DOI

References:

[1] Al-Tamimi, A.; Abu-Khalaf, M.; Lewis, F. L., Model-free \(Q\)-learning designs for discrete-time zero-sum games with application to \(H_\infty\) control, Automatica, 43, 3, 473-481 (2007) · Zbl 1137.93321
[2] Baird, L. C. III (1994). Reinforcement learning in continuous-time: advantage updating. In Proc. of ICNN. vol. 4; Baird, L. C. III (1994). Reinforcement learning in continuous-time: advantage updating. In Proc. of ICNN. vol. 4
[3] Balakrishnan, S. N.; Ding, D.; Lewis, F. L., Issues on stability of ADP feedback controllers for dynamical systems, IEEE Transactions on Systems, Man and Cybernetics, Part B, 38, 4, 913-917 (2008)
[4] Bertsekas, D. P.; Tsitsiklis, J. N., Neuro-dynamic programming (1996), Athena Scientific: Athena Scientific Belmont, MA · Zbl 0924.68163
[5] Bertsekas, D. P.; Tsitsiklis, J. N., Dynamic programming and suboptimal control: a survey from ADP to MPC, European Journal of Control, 11, 310-334 (2005) · Zbl 1293.49056
[6] Bradtke, S. J., & Ydstie, B. E. (1994). Adaptive linear quadratic control using policy iteration. In Proc. ACC; Bradtke, S. J., & Ydstie, B. E. (1994). Adaptive linear quadratic control using policy iteration. In Proc. ACC
[7] Dong, W., & Farrell, J. A. (2009). Adaptive approximately optimal control of unknown nonlinear systems based on locally weighted learning. In Proc. CDC; Dong, W., & Farrell, J. A. (2009). Adaptive approximately optimal control of unknown nonlinear systems based on locally weighted learning. In Proc. CDC
[8] Doya, K., Reinforcement learning in continuous-time and space, Neural Computation, 12, 219-245 (2000)
[9] Kaelbling, L. P.; Moore, A. W., Reinforcement learning: a survey, Journal of Artificial Intelligence Research, 4, 237-285 (1996)
[10] Khalil, H. K., Nonlinear systems (2002), Prentice Hall · Zbl 0626.34052
[11] Kleinman, D., On the iterative technique for Riccati equation computations, IEEE Transactions on Automatic Control, 13, 1, 114-115 (1968)
[12] Kokotovic, P.; Khalil, H. H.; O’Reilly, J., Singular perturbation methods in control: analysis and design (1986), Academic Press, Inc. · Zbl 0646.93001
[13] Landelius, T. (1997). Reinforcement learning and distributed local model synthesis. Ph.D. dissertation; Landelius, T. (1997). Reinforcement learning and distributed local model synthesis. Ph.D. dissertation
[14] Lee, J. M.; Lee, J. H., Approximate dynamic programming strategies and their applicability for process control: a review and future directions, International Journal of Control, Automation, and Systems (IJCAS), 2, 3, 263-278 (2004)
[15] Lee, J. Y., Park, J. B., & Choi, Y. H. (2009). Model-free approximate dynamic programming for continuous-time linear systems. In Proc. CDC; Lee, J. Y., Park, J. B., & Choi, Y. H. (2009). Model-free approximate dynamic programming for continuous-time linear systems. In Proc. CDC
[16] Lewis, F. L.; Syrmos, V., Optimal control (1995), Wiley: Wiley New York
[17] Lewis, F. L.; Vamvoudakis, K. G., Reinforcement learning for partially observable dynamic processes: adaptive dynamic programming using measured output data, IEEE Transactions on Systems, Man and Cybernetics, Part B, 41, 1, 14-25 (2010)
[18] Lewis, F. L.; Vrabie, D., Reinforcement learning and adaptive dynamic programming for feedback control, IEEE Circuits and Systems Magazine, 9, 3, 32-50 (2009)
[19] Mehta, P., & Meyn, S. (2009). \(Q Proc. CDC \); Mehta, P., & Meyn, S. (2009). \(Q Proc. CDC \)
[20] Murray, J. J.; Cox, C. J.; Lendaris, G. G.; Saeks, R., Adaptive dynamic programming, IEEE Transactions on Systems, Man and Cybernetics, Part C, 32, 2, 140-153 (2002)
[21] Powell, W. B., Approximate dynamic programming: solving the curses of dimensionality (2007), Wiley-Interscience · Zbl 1156.90021
[22] Prokhorov, D. V.; Wunsch, D. C., Adaptive critic designs, IEEE Transactions on Neural Networks, 8, 5, 997-1007 (1997)
[23] Si, J.; Barto, A. G.; Powell, W. B.; Wunsch, D., Handbook of learning and approximate dynamic programming (2004), Wiley-IEEE Press
[24] Sutton, R. S.; Barto, A. G., Reinforcement learning—an introduction (1998), MIT Press: MIT Press Cambridge, Massachusetts
[25] Vamvoudakis, K. G.; Lewis, F. L., Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem, Automatica, 46, 5, 878-888 (2010) · Zbl 1191.49038
[26] Vrabie, D.; Pastravanu, O.; Abu-Khalaf, M.; Lewis, F. L., Adaptive optimal control for continuous-time linear systems based on policy iteration, Automatica, 45, 2, 477-484 (2009) · Zbl 1158.93354
[27] Wang, F. Y.; Zhang, H.; Liu, D., Adaptive dynamic programming: an introduction, IEEE Computational Intelligence Magazine, 4, 3, 39-47 (2009)
[28] Watkins, C. J. C. H., & Dayan, P. (1989). Learning from delayed rewards. Ph.D. dissertation; Watkins, C. J. C. H., & Dayan, P. (1989). Learning from delayed rewards. Ph.D. dissertation
[29] Watkins, C. J.C. H.; Dayan, P., \(Q\)-learning, Machine Learning, 8, 279-292 (1992) · Zbl 0773.68062
[30] Webos, P. J., Approximate dynamic programming for real-time control and neural modeling, (White, D. A.; Sofge, D. A., Handbook of intelligent control (1992), Van Nostrand Reinhold: Van Nostrand Reinhold New York)
[31] Willems, J. C.; Rapisarda, P.; Markovsky, I.; Moor, B. L.M., A note on persistency of excitation, Systems & Control Letters, 54, 4, 325-329 (2005) · Zbl 1129.93362
[32] Zhang, H., Huang, J., & Lewis, F. L. (2009). Algorithm and stability of ATC receding horizon control. In IEEE Symp. ADPRL; Zhang, H., Huang, J., & Lewis, F. L. (2009). Algorithm and stability of ATC receding horizon control. In IEEE Symp. ADPRL
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.