×

Provably efficient reinforcement learning in decentralized general-sum Markov games. (English) Zbl 1519.91029

Summary: This paper addresses the problem of learning an equilibrium efficiently in general-sum Markov games through decentralized multi-agent reinforcement learning. Given the fundamental difficulty of calculating a Nash equilibrium (NE), we instead aim at finding a coarse correlated equilibrium (CCE), a solution concept that generalizes NE by allowing possible correlations among the agents’ strategies. We propose an algorithm in which each agent independently runs optimistic V-learning (a variant of Q-learning) to efficiently explore the unknown environment, while using a stabilized online mirror descent (OMD) subroutine for policy updates. We show that the agents can find an \(\epsilon\)-approximate CCE in at most \(\widetilde{O}(H^6 SA /\epsilon^2)\) episodes, where \(S\) is the number of states, \(A\) is the size of the largest individual action space, and \(H\) is the length of an episode. This appears to be the first sample complexity result for learning in generic general-sum Markov games. Our results rely on a novel investigation of an anytime high-probability regret bound for OMD with a dynamic learning rate and weighted regret, which would be of independent interest. One key feature of our algorithm is that it is decentralized, in the sense that each agent has access to only its local information, and is completely oblivious to the presence of others. This way, our algorithm can readily scale up to an arbitrary number of agents, without suffering from the exponential dependence on the number of agents.

MSC:

91A15 Stochastic games, stochastic differential games
91A26 Rationality and learning in game theory

Software:

Libratus

References:

[1] Arslan, G.; Yüksel, S., Decentralized Q-learning for stochastic teams and games, IEEE Trans Autom Control, 62, 4, 1545-1558 (2016) · Zbl 1366.91030 · doi:10.1109/TAC.2016.2598476
[2] Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (1995) Gambling in a rigged casino: the adversarial multi-armed bandit problem. In: Foundations of computer science. IEEE, pp 322-331 · Zbl 0938.68920
[3] Aumann, RJ, Correlated equilibrium as an expression of Bayesian rationality, Econom J Econom Soc, 55, 1-18 (1987) · Zbl 0633.90094
[4] Bai Y, Jin C (2020) Provable self-play algorithms for competitive reinforcement learning. In: International conference on machine learning, pp 551-560
[5] Bai Y, Jin C, Yu T (2020) Near-optimal reinforcement learning with self-play. In: Advances in neural information processing systems, vol 33
[6] Bernstein, DS; Givan, R.; Immerman, N.; Zilberstein, S., The complexity of decentralized control of Markov decision processes, Math Oper Res, 27, 4, 819-840 (2002) · Zbl 1082.90593 · doi:10.1287/moor.27.4.819.297
[7] Brown, N.; Sandholm, T., Superhuman AI for heads-up no-limit poker: Libratus beats top professionals, Science, 359, 6374, 418-424 (2018) · Zbl 1415.68163 · doi:10.1126/science.aao1733
[8] Bubeck S et al (2015) Convex optimization: algorithms and complexity. Found Trends® Mach Learn 8(3-4):231-357 · Zbl 1365.90196
[9] Cesa-Bianchi, N.; Lugosi, G., Prediction, learning, and games (2006), Cambridge: Cambridge University Press, Cambridge · Zbl 1114.91001 · doi:10.1017/CBO9780511546921
[10] Chen, X.; Deng, X.; Teng, SH, Settling the complexity of computing two-player Nash equilibria, J ACM, 56, 3, 1-57 (2009) · Zbl 1325.68095 · doi:10.1145/1516512.1516516
[11] Claus C (1998) Boutilier C (1998) The dynamics of reinforcement learning in cooperative multiagent systems. In: AAAI conference on artificial intelligence, vol 746-752, p 2
[12] Daskalakis, C.; Goldberg, PW; Papadimitriou, CH, The complexity of computing a Nash equilibrium, SIAM J Comput, 39, 1, 195-259 (2009) · Zbl 1185.91019 · doi:10.1137/070699652
[13] Daskalakis C, Foster DJ, Golowich N (2020) Independent policy gradient methods for competitive reinforcement learning. In: Advances in neural information processing systems, vol 33
[14] Fang H, Harvey N, Portella V, Friedlander M (2020) Online mirror descent and dual averaging: keeping pace in the dynamic case. In: International conference on machine learning, pp 3008-3017
[15] Filar, J.; Vrieze, K., Competitive Markov decision processes (2012), Berlin: Springer, Berlin
[16] Greenwald A, Hall K (2003) Correlated-Q learning. In: International conference on machine learning, pp 242-249
[17] Hart, S.; Mas-Colell, A., A simple adaptive procedure leading to correlated equilibrium, Econometrica, 68, 5, 1127-1150 (2000) · Zbl 1020.91003 · doi:10.1111/1468-0262.00153
[18] Ho, YC, Team decision theory and information structures, Proc IEEE, 68, 6, 644-654 (1980) · doi:10.1109/PROC.1980.11718
[19] Hu, J.; Wellman, MP, Nash Q-learning for general-sum stochastic games, J Mach Learn Res, 4, 1039-1069 (2003) · Zbl 1094.68076
[20] Jaksch, T.; Ortner, R.; Auer, P., Near-optimal regret bounds for reinforcement learning, J Mach Learn Res, 11, 4, 1563-1600 (2010) · Zbl 1242.68229
[21] Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? In: International conference on neural information processing systems, pp 4868-4878
[22] Kober, J.; Bagnell, JA; Peters, J., Reinforcement learning in robotics: a survey, Int J Robot Res, 32, 11, 1238-1274 (2013) · doi:10.1177/0278364913495721
[23] Leslie, DS; Collins, EJ, Individual Q-learning in normal form games, SIAM J Control Optim, 44, 2, 495-514 (2005) · Zbl 1210.93085 · doi:10.1137/S0363012903437976
[24] Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. In: Machine learning, pp 157-163
[25] Littman ML (2001) Friend-or-Foe Q-learning in general-sum games. In: International conference on machine learning, pp 322-328
[26] Littman ML, Szepesvári C (1996) A generalized reinforcement-learning model: convergence and applications. In: International conference on machine learning, pp 310-318
[27] Liu Q, Yu T, Bai Y, Jin C (2021) A sharp analysis of model-based reinforcement learning with self-play. In: International conference on machine learning. PMLR, pp 7001-7010
[28] Moulin, H.; Vial, JP, Strategically zero-sum games: the class of games whose completely mixed equilibria cannot be improved upon, Int J Game Theory, 7, 3-4, 201-221 (1978) · Zbl 0419.90087 · doi:10.1007/BF01769190
[29] Nayyar, A.; Gupta, A.; Langbort, C.; Başar, T., Common information based Markov perfect equilibria for stochastic games with asymmetric information: finite games, IEEE Trans Autom Control, 59, 3, 555-570 (2013) · Zbl 1360.91021 · doi:10.1109/TAC.2013.2283743
[30] Nayyar, A.; Mahajan, A.; Teneketzis, D., Decentralized stochastic control with partial history sharing: a common information approach, IEEE Trans Autom Control, 58, 7, 1644-1658 (2013) · Zbl 1369.90187 · doi:10.1109/TAC.2013.2239000
[31] Nemirovskij, AS; Yudin, DB, Problem complexity and method efficiency in optimization (1983), New York: Wiley-Interscience, New York · Zbl 0501.90062
[32] Neu, G., Explore no more: improved high-probability regret bounds for non-stochastic bandits, Adv Neural Inf Process Syst, 28, 3168-3176 (2015)
[33] Orabona, F.; Pál, D., Scale-free online learning, Theor Comput Sci, 716, 50-69 (2018) · Zbl 1388.68255 · doi:10.1016/j.tcs.2017.11.021
[34] Papadimitriou, CH; Roughgarden, T., Computing correlated equilibria in multi-player games, J ACM, 55, 3, 1-29 (2008) · Zbl 1314.91012 · doi:10.1145/1379759.1379762
[35] Pérolat J, Strub F, Piot B, Pietquin O (2017) Learning Nash equilibrium for general-sum Markov games from batch data. In: Artificial intelligence and statistics. PMLR, pp 232-241
[36] Prasad H, LA P, Bhatnagar S (2015) Two-timescale algorithms for learning Nash equilibria in general-sum stochastic games. In: International conference on autonomous agents and multiagent systems, pp 1371-1379
[37] Shalev-Shwartz S, Shammah S, Shashua A (2016) Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295
[38] Shapley, LS, Stochastic games, Proc Natl Acad Sci, 39, 10, 1095-1100 (1953) · Zbl 0051.35805 · doi:10.1073/pnas.39.10.1095
[39] Sidford A, Wang M, Yang L, Ye Y (2020) Solving discounted stochastic two-player games with near-optimal time and sample complexity. In: International conference on artificial intelligence and statistics. PMLR, pp 2992-3002
[40] Silver, D.; Huang, A.; Maddison, CJ; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M., Mastering the game of Go with deep neural networks and tree search, Nature, 529, 7587, 484-489 (2016) · doi:10.1038/nature16961
[41] Tian Y, Wang Y, Yu T, Sra S (2021) Online learning in unknown Markov games. In: International conference on machine learning
[42] Vinyals, O.; Babuschkin, I.; Czarnecki, WM; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, DH; Powell, R.; Ewalds, T.; Georgiev, P., Grandmaster level in StarCraft II using multi-agent reinforcement learning, Nature, 575, 7782, 350-354 (2019) · doi:10.1038/s41586-019-1724-z
[43] Wang, X.; Sandholm, T., Reinforcement learning to play an optimal Nash equilibrium in team Markov games, Adv Neural Inf Process Syst, 15, 1603-1610 (2002)
[44] Wei CY, Hong YT, Lu CJ (2017) Online reinforcement learning in stochastic games. In: International conference on neural information processing systems, pp 4994-5004
[45] Wei CY, Lee CW, Zhang M, Luo H (2021) Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive Markov games. In: Annual conference on learning theory
[46] Xie Q, Chen Y, Wang Z, Yang Z (2020) Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium. In: Conference on learning theory, pp 3674-3682
[47] Yongacoglu B, Arslan G, Yüksel S (2019) Learning team-optimality for decentralized stochastic control and dynamic games. arXiv preprint arXiv:1903.05812
[48] Yüksel, S.; Başar, T., Stochastic networked control systems: stabilization and optimization under information constraints (2013), Berlin: Springer, Berlin · Zbl 1280.93003 · doi:10.1007/978-1-4614-7085-4
[49] Zehfroosh A, Tanner HG (2020) PAC reinforcement learning algorithm for general-sum Markov games. arXiv preprint arXiv:2009.02605
[50] Zhao Y, Tian Y, Lee JD, Du SS (2021) Provably efficient policy gradient methods for two-player zero-sum Markov games. arXiv preprint arXiv:2102.08903
[51] Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. In: International conference on machine learning, pp 928-936
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.