Document Zbl 07907417

Pternea, Moschoula; Singh, Prerna; Chakraborty, Abir; Oruganti, Yagna; Milletari, Mirco; Bapat, Sayli; Jiang, Kebei

The RL/LLM taxonomy tree: reviewing synergies between reinforcement learning and large language models. (English) Zbl 07907417

J. Artif. Intell. Res. (JAIR) 80, 1525-1573 (2024).

Summary: In this work, we review research studies that combine Reinforcement Learning (RL) and Large Language Models (LLMs), two areas that owe their momentum to the development of Deep Neural Networks (DNNs). We propose a novel taxonomy of three main classes based on the way that the two model types interact with each other. The first class, RL4LLM, includes studies where RL is leveraged to improve the performance of LLMs on tasks related to Natural Language Processing (NLP). RL4LLM is divided into two sub-categories depending on whether RL is used to directly fine-tune an existing LLM or to improve the prompt of the LLM. In the second class, LLM4RL, an LLM assists the training of an RL model that performs a task that is not inherently related to natural language. We further break down LLM4RL based on the component of the RL training framework that the LLM assists or replaces, namely reward shaping, goal generation, and policy function. Finally, in the third class, RL+LLM, an LLM and an RL agent are embedded in a common planning framework without either of them contributing to training or fine-tuning of the other. We further branch this class to distinguish between studies with and without natural language feedback. We use this taxonomy to explore the motivations behind the synergy of LLMs and RL and explain the reasons for its success, while pinpointing potential shortcomings and areas where further research is needed, as well as alternative methodologies that serve the same goal.

MSC:

68T05	Learning and adaptive systems in artificial intelligence
68T50	Natural language processing
68T40	Artificial intelligence for robotics

Keywords:

reinforcement learning; natural language; planning; Markov decision processes; large language models; robotics

Software:

GPT-4; ChatGPT; GPT-3; OpenAI Gym; Claude; AlphaZero; BERT

Cite Review PDF

Full Text: DOI arXiv

References:

[1]	Abbeel, P., & Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on machine learning (p. 1). New York, NY, USA: Association for Computing Machinery. Retrieved from https:// doi.org/10.1145/1015330.1015430 doi: 10.1145/1015330.1015430 · doi:10.1145/1015330.1015430
[2]	Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., . . . Zeng, A. (2022). Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arxiv:2204.01691 .
[3]	Alexander, J. (2018). Learning from humans: what is inverse reinforcement learning? The Gradient. https://thegradient.pub/learning-from-humans-what-is-inverse -reinforcement-learning/.
[4]	Anthropic. (2023). Claude. Retrieved from https://www.anthropic.com/claude AnyRobotics. (2023). Anymal. Retrieved from https://www.anybotics.com/robotics/ anymal/ Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). A brief survey of deep reinforcement learning. arXiv preprint arxiv:1708.05866 . doi: https:// doi.org/10.1109/MSP.2017.2743240 · doi:10.1109/MSP.2017.2743240
[5]	Aubret, A., Matignon, L., & Hassas, S. (2019). A survey on intrinsic motivation in rein-forcement learning. arXiv preprint arxiv:1908.06976 .
[6]	Bai, H., Cheng, R., & Jin, Y. (2023). Evolutionary reinforcement learning: A survey. Intelligent Computing, 2 , 0025. Retrieved from https://spj.science.org/doi/abs/ 10.34133/icomputing.0025 doi: 10.34133/icomputing.0025 · doi:10.34133/icomputing.0025
[7]	Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., . . . Kaplan, J. (2022). Training a helpful and harmless assistant with reinforcement learning from human feed-back. arXiv preprint arxiv:2204.05862 .
[8]	Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., . . . Kaplan, J. (2022). Constitutional ai: Harmlessness from ai feedback. arXiv preprint arxiv:2212.08073 .
[9]	Bard, N., Foerster, J. N., Chandar, S., Burch, N., Lanctot, M., Song, H. F., . . . Bowling, M. (2020). The hanabi challenge: A new frontier for ai research. Artificial Intelligence, 280 , 103216. Retrieved from https://www .sciencedirect .com/science/article/ pii/S0004370219300116 doi: https://doi.org/10.1016/j.artint.2019.103216 · Zbl 1476.68223 · doi:10.1016/j.artint.2019.103216
[10]	Beck, J., Vuorio, R., Zheran Liu, E., Xiong, Z., Zintgraf, L., Finn, C., & Whiteson, S. (2023, January). A Survey of Meta-Reinforcement Learning. arXiv e-prints, arXiv:2301.08028. doi: 10.48550/arXiv.2301.08028 · doi:10.48550/arXiv.2301.08028
[11]	Bellman, R. (1957). A markovian decision process. Indiana Univ. Math. J., 6 , 679-684. · Zbl 0078.34101
[12]	Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. In Proceedings of the 26th annual international conference on machine learning (p. 41-48). New York, NY, USA: Association for Computing Machinery. Retrieved from https:// doi.org/10.1145/1553374.1553380 doi: 10.1145/1553374.1553380 · doi:10.1145/1553374.1553380
[13]	BitCraze. (2023). Crazyflie. Retrieved from https://www .bitcraze .io/products/ crazyflie-2-1/
[14]	Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). Openai gym. arXiv preprint arxiv:1606.01540 .
[15]	Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., . . . Amodei, D. (2020). Language models are few-shot learners. arXiv preprint arxiv:2005.14165 .
[16]	Cambria, E., & White, B. (2014). Jumping nlp curves: A review of natural language processing research [review article]. IEEE Computational Intelligence Magazine, 9 (2), 48-57. doi: 10.1109/MCI.2014.2307227 · doi:10.1109/MCI.2014.2307227
[17]	Cao, B., Lin, H., Han, X., & Sun, L. (2023). The life cycle of knowledge in big language models: A survey. arXiv preprint arxiv:2303.07616 .
[18]	Cao, Y., Yao, L., McAuley, J., & Sheng, Q. Z. (2023). Reinforcement learning for generative ai: A survey. arXiv preprint arXiv:2308.14328 .
[19]	Cao, Z., Ramachandra, R. A., & Yu, K. (2023). Temporal video-language alignment network for reward shaping in reinforcement learning. arXiv preprint arxiv:2302.03954 .
[20]	Carta, T., Romac, C., Wolf, T., Lamprier, S., Sigaud, O., & Oudeyer, P.-Y. (2023). Grounding large language models in interactive environments with online reinforcement learning. arXiv preprint arxiv:2302.02662 .
[21]	Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., . . . Xie, X. (2023). A survey on evaluation of large language models. arXiv preprint arxiv:2307.03109 .
[22]	Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., . . . Mordatch, I. (2021). Decision transformer: Reinforcement learning via sequence modeling.
[23]	Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., & Sutskever, I. (2020). Generative pretraining from pixels. In Proceedings of the 37th international conference on machine learning. JMLR.org.
[24]	Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., . . . others (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 .
[25]	Chen, T., Murali, A., & Gupta, A. (2019). Hardware conditioned policies for multi-robot transfer learning. arXiv preprint arxiv:1811.09864 .
[26]	Chentanez, N., Barto, A., & Singh, S. (2004). Intrinsically motivated reinforcement learning. In L. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems (Vol. 17). MIT Press. Retrieved from https://proceedings.neurips.cc/ paperfiles/paper/2004/file/4be5a36cbaca8ab9d2066debfe4e65c1-Paper.pdf
[27]	Chevalier-Boisvert, M., Bahdanau, D., Lahlou, S., Willems, L., Saharia, C., Nguyen, T. H., & Bengio, Y. (2019). Babyai: A platform to study the sample efficiency of grounded language learning. arXiv preprint arxiv:1810.08272 .
[28]	Choi, K., Cundy, C., Srivastava, S., & Ermon, S. (2022). Lmpriors: Pre-trained language models as task-specific priors. arXiv preprint arxiv:2210.12530 .
[29]	Chowdhary, K., & Chowdhary, K. (2020). Natural language processing. Fundamentals of artificial intelligence, 603-649.
[30]	Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., . . . Fiedel, N. (2022). Palm: Scaling language modeling with pathways. arXiv preprint arxiv:2204.02311 .
[31]	Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., . . . Schulman, J. (2021). Training verifiers to solve math word problems. arXiv preprint arxiv:2110.14168 .
[32]	Dasgupta, I., Kaeser-Chen, C., Marino, K., Ahuja, A., Babayan, S., Hill, F., & Fergus, R. (2023). Collaborating with language models for embodied reasoning. arXiv preprint arxiv:2302.00763 .
[33]	Deng, M., Wang, J., Hsieh, C.-P., Wang, Y., Guo, H., Shu, T., . . . Hu, Z. (2022). Rl-prompt: Optimizing discrete text prompts with reinforcement learning. arXiv preprint arxiv:2205.12548 .
[34]	Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arxiv:1810.04805 .
[35]	Driess, D., Xia, F., Sajjadi, M. S. M., Lynch, C., Chowdhery, A., Ichter, B., . . . Flo-rence, P. (2023). Palm-e: An embodied multimodal language model. arXiv preprint arxiv:2303.03378 .
[36]	Du, M., He, F., Zou, N., Tao, D., & Hu, X. (2023). Shortcut learning of large language models in natural language understanding. arXiv preprint arxiv:2208.11857 .
[37]	Du, W., & Ding, S. (2021). A survey on multi-agent deep reinforcement learning: from the perspective of challenges and applications. Artificial Intelligence Review , 54 , 3215-3238.
[38]	Du, Y., Watkins, O., Wang, Z., Colas, C., Darrell, T., Abbeel, P., . . . Andreas, J. (2023). Guiding pretraining in reinforcement learning with large language models. arXiv preprint arxiv:2302.06692 .
[39]	Eschmann, J. (2021). Reward function design in reinforcement learning. In B. Be-lousov, H. Abdulsamad, P. Klink, S. Parisi, & J. Peters (Eds.), Reinforcement learn-ing algorithms: Analysis and applications (pp. 25-33). Cham: Springer International Publishing. Retrieved from https://doi.org/10.1007/978-3-030-41188-63 doi: 10.1007/978-3-030-41188-63 · doi:10.1007/978-3-030-41188-6\3
[40]	Fan, A., Gokkaya, B., Harman, M., Lyubarskiy, M., Sengupta, S., Yoo, S., & Zhang, J. M. (2023). Large language models for software engineering: Survey and open problems. arXiv preprint arxiv:2310.03533 .
[41]	François-Lavet, V., Henderson, P., Islam, R., Bellemare, M. G., Pineau, J., et al. (2018). An introduction to deep reinforcement learning. Foundations and Trends ® in Machine Learning, 11 (3-4), 219-354.
[42]	Fujimoto, S., & Gu, S. S. (2021). A minimalist approach to offline reinforcement learning. arXiv preprint arxiv:2106.06860 .
[43]	Furuta, H., Matsuo, Y., & Gu, S. S. (2022). Generalized decision transformer for offline hindsight information matching. arXiv preprint arxiv:2111.10364 .
[44]	Gallegos, I. O., Rossi, R. A., Barrow, J., Tanjim, M. M., Kim, S., Dernoncourt, F., . . . Ahmed, N. K. (2023). Bias and fairness in large language models: A survey. arXiv preprint arxiv:2309.00770 .
[45]	Garaffa, L. C., Basso, M., Konzen, A. A., & de Freitas, E. P. (2023). Reinforcement learning for mobile robotics exploration: A survey. IEEE Transactions on Neural Networks and Learning Systems, 34 (8), 3796-3810. doi: 10.1109/TNNLS.2021.3124466 · doi:10.1109/TNNLS.2021.3124466
[46]	Ghalandari, D. G., Hokamp, C., & Ifrim, G. (2022). Efficient unsupervised sentence compression by fine-tuning transformers with reinforcement learning. arXiv preprint arxiv:2205.08221 .
[47]	Goyal, P., Niekum, S., & Mooney, R. J. (2019). Using natural language for reward shaping in reinforcement learning. arXiv preprint arxiv:1903.02020 .
[48]	Gpt-4 technical report (Tech. Rep.). (2023). OpenAI. Retrieved from https://openai.com/ contributions/gpt-4v
[49]	Gronauer, S., & Diepold, K. (2022). Multi-agent deep reinforcement learning: a survey. Artificial Intelligence Review , 1-49.
[50]	Gu, J., Xiang, F., Li, X., Ling, Z., Liu, X., Mu, T., . . . Su, H. (2023). Maniskill2: A unified benchmark for generalizable manipulation skills. arXiv preprint arxiv:2302.04659 .
[51]	Guo, Z., Jin, R., Liu, C., Huang, Y., Shi, D., Supryadi, . . . Xiong, D. (2023). Evaluating large language models: A comprehensive survey. arXiv preprint arxiv:2310.19736 .
[52]	Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arxiv:1801.01290 .
[53]	Haddadin, S., Parusel, S., Johannsmeier, L., Golz, S., Gabl, S., Walch, F., . . . Haddadin, S. (2022). The franka emika robot: A reference platform for robotics research and education. IEEE Robotics & Automation Magazine, 29 (2), 46-64. doi: 10.1109/MRA.2021.3138382 · doi:10.1109/MRA.2021.3138382
[54]	Hu, H., & Sadigh, D. (2023). Language instructed reinforcement learning for human-ai coordination. arXiv preprint arxiv:2304.07297 .
[55]	Hu, J., Tao, L., Yang, J., & Zhou, C. (2023). Aligning language models with offline learning from human feedback. arXiv preprint arxiv:2308.12050 .
[56]	Huang, J., & Chang, K. C.-C. (2022). Towards reasoning in large language models: A survey. arXiv preprint arxiv:2212.10403 .
[57]	Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma, S., . . . Wei, F. (2023). Lan-guage is not all you need: Aligning perception with language models. arXiv preprint arxiv:2302.14045 .
[58]	Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., . . . Ichter, B. (2022). Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arxiv:2207.05608 .
[59]	Janner, M., Li, Q., & Levine, S. (2021). Offline reinforcement learning as one big sequence modeling problem. arXiv preprint arxiv:2106.02039 .
[60]	Jiang, Y., Gao, Q., Thattai, G., & Sukhatme, G. (2023). Language-informed transfer learning for embodied household activities. arXiv preprint arxiv:2301.05318 .
[61]	Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of artificial intelligence research, 4 , 237-285.
[62]	Kant, Y., Ramachandran, A., Yenamandra, S., Gilitschenski, I., Batra, D., Szot, A., & Agrawal, H. (2022). Housekeep: Tidying virtual households using commonsense reason-ing.
[63]	In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, & T. Hassner (Eds.), Computer vision -eccv 2022 (pp. 355-373). Cham: Springer Nature Switzerland.
[64]	Kim, J., & Lee, B. (2023). Ai-augmented surveys: Leveraging large language models for opinion prediction in nationally representative surveys. arXiv preprint arxiv:2305.09620 .
[65]	Kumar, A., Zhou, A., Tucker, G., & Levine, S. (2020). Conservative q-learning for offline reinforcement learning. arXiv preprint arxiv:2006.04779 .
[66]	Kwon, M., Xie, S. M., Bullard, K., & Sadigh, D. (2023). Reward design with language models. arXiv preprint arxiv:2303.00001 .
[67]	Levine, S., Kumar, A., Tucker, G., & Fu, J. (2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arxiv:2005.01643 .
[68]	Lewis, M., Yarats, D., Dauphin, Y. N., Parikh, D., & Batra, D. (2017). Deal or no deal? end-to-end learning for negotiation dialogues. arXiv preprint arxiv:1706.05125 .
[69]	Li, L., Zhang, Y., Liu, D., & Chen, L. (2023). Large language models for generative recommendation: A survey and visionary discussions. arXiv preprint arxiv:2309.01157 .
[70]	Li, Y., Wei, F., Zhao, J., Zhang, C., & Zhang, H. (2023). Rain: Your language models can align themselves without finetuning. arXiv preprint arXiv:2309.07124 .
[71]	Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., . . . Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arxiv:1509.02971 .
[72]	Lin, J., Dai, X., Xi, Y., Liu, W., Chen, B., Li, X., . . . Zhang, W. (2023). How can recommender systems benefit from large language models: A survey. arXiv preprint arxiv:2306.05817 .
[73]	Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., . . . Tang, J. (2023). Agentbench: Evaluating llms as agents. arXiv preprint arxiv:2308.03688 .
[74]	Liu, Y., Han, T., Ma, S., Zhang, J., Yang, Y., Tian, J., . . . Ge, B. (2023, sep). Summary of ChatGPT-related research and perspective towards the future of large language models. Meta-Radiology, 1 (2), 100017. Retrieved from https://doi.org/10.1016/j.metrad .2023.100017 doi: 10.1016/j.metrad.2023.100017 · doi:10.1016/j.metrad.2023
[75]	Ma, Y. J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., . . . Anandku-mar, A. (2023). Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931 .
[76]	Mazyavkina, N., Sviridov, S., Ivanov, S., & Burnaev, E. (2021). Reinforcement learn-ing for combinatorial optimization: A survey. Computers & Operations Research, 134 , 105400. Retrieved from https://www .sciencedirect .com/science/article/pii/ S0305054821001660 doi: https://doi.org/10.1016/j.cor.2021.105400 · Zbl 1511.90356 · doi:10.1016/j.cor.2021.105400
[77]	Merity, S., Xiong, C., Bradbury, J., & Socher, R. (2016). Pointer sentinel mixture models. arXiv preprint arxiv:1609.07843 .
[78]	Mialon, G., Dessì, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R., . . . Scialom, T. (2023). Augmented language models: a survey. arXiv preprint arxiv:2302.07842 .
[79]	Min, B., Ross, H., Sulem, E., Veyseh, A. P. B., Nguyen, T. H., Sainz, O., . . . Roth, D. (2021). Recent advances in natural language processing via large pre-trained language models: A survey. arXiv preprint arxiv:2111.01243 .
[80]	Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Ried-miller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arxiv:1312.5602 .
[81]	Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., . . . Hass-abis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518 , 529-533. Retrieved from https://api.semanticscholar.org/CorpusID:205242740
[82]	Mnih, V., Kavukcuoglu, K., & Silver, D. e. a. (2015). Human-level control through deep reinforcement learning. Nature, 518 , 529-533.
[83]	Nastase, V., Mihalcea, R., & Radev, D. R. (2015). A survey of graphs in natural language processing. Natural Language Engineering, 21 (5), 665-698.
[84]	Ng, A. Y., & Russell, S. J. (2000). Algorithms for inverse reinforcement learning. In Proceedings of the seventeenth international conference on machine learning (p. 663-670). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
[85]	Ni, J., Ábrego, G. H., Constant, N., Ma, J., Hall, K. B., Cer, D., & Yang, Y. (2021). Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arxiv:2108.08877 .
[86]	Nie, M., Chen, D., & Wang, D. (2023). Reinforcement learning on graphs: A survey. IEEE Transactions on Emerging Topics in Computational Intelligence.
[87]	OpenAI. (2023a). Chatgpt. Retrieved from https://chat.openai.com/chat
[88]	OpenAI. (2023b). Gpt-3. Retrieved from https://openai.com/index/language-models -are-few-shot-learners/
[89]	OpenAI. (2023c). Gpt-3.5. Retrieved from https://platform.openai.com/docs/models/ gpt-3-5
[90]	OpenAI. (2023d). Gpt-4 technical report. arXiv preprint arxiv:2303.08774 .
[91]	OpenAI. (2024). Kinds of rl algorithms. Retrieved from https://spinningup.openai .com/en/latest/spinningup/rl intro2.html
[92]	Oshikawa, R., Qian, J., & Wang, W. Y. (2018). A survey on natural language processing for fake news detection. arXiv preprint arXiv:1811.00770 .
[93]	Otter, D. W., Medina, J. R., & Kalita, J. K. (2020). A survey of the usages of deep learning for natural language processing. IEEE transactions on neural networks and learning systems, 32 (2), 604-624.
[94]	Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., . . . Lowe, R. (2022). Training language models to follow instructions with human feedback. arXiv preprint arxiv:2203.02155 .
[95]	Padakandla, S. (2021, jul). A survey of reinforcement learning algorithms for dynamically varying environments. ACM Comput. Surv., 54 (6). Retrieved from https://doi.org/ 10.1145/3459991 doi: 10.1145/3459991 · doi:10.1145/3459991
[96]	Pan, S., Luo, L., Wang, Y., Chen, C., Wang, J., & Wu, X. (2023). Unifying large language models and knowledge graphs: A roadmap. arXiv preprint arxiv:2306.08302 .
[97]	Patel, A., Bhattamishra, S., & Goyal, N. (2021). Are nlp models really able to solve simple math word problems? arXiv preprint arxiv:2103.07191 .
[98]	Pateria, S., Subagdja, B., Tan, A.-h., & Quek, C. (2021, jun). Hierarchical reinforcement learning: A comprehensive survey. ACM Comput. Surv., 54 (5). Retrieved from https:// doi.org/10.1145/3453160 doi: 10.1145/3453160 · doi:10.1145/3453160
[99]	Peng, X. B., Kumar, A., Zhang, G., & Levine, S. (2019). Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arxiv:1910.00177 .
[100]	Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., . . . Irving, G. (2022). Red teaming language models with language models. arXiv preprint arxiv:2202.03286 .
[101]	Peters, J., & Schaal, S. (2007). Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on machine learning (pp. 745-750).
[102]	Prudencio, R. F., Maximo, M. R., & Colombini, E. L. (2023). A survey on offline reinforce-ment learning: Taxonomy, review, and open problems. IEEE Transactions on Neural Networks and Learning Systems.
[103]	Pryzant, R., Iter, D., Li, J., Lee, Y. T., Zhu, C., & Zeng, M. (2023). Automatic prompt optimization with “gradient descent” and beam search. arXiv preprint arxiv:2305.03495 .
[104]	Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., & Huang, X. (2020). Pre-trained models for natural language processing: A survey. Science China Technological Sciences, 63 (10), 1872-1897.
[105]	Quartey, B., Shah, A., & Konidaris, G. (2023). Exploiting contextual structure to generate useful auxiliary tasks. arXiv preprint arxiv:2303.05038 .
[106]	Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., . . . Sutskever, I. (2021). Learning transferable visual models from natural language supervision. arXiv preprint arxiv:2103.00020 .
[107]	Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Lan-guage models are unsupervised multitask learners.. Retrieved from https://api .semanticscholar.org/CorpusID:160025533
[108]	Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., . . . Irving, G. (2022). Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arxiv:2112.11446 .
[109]	Ramamurthy, R., Ammanabrolu, P., Brantley, K., Hessel, J., Sifa, R., Bauckhage, C., . . . Choi, Y. (2023). Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arxiv:2210.01241 .
[110]	Reid, M., Yamada, Y., & Gu, S. S. (2022). Can wikipedia help offline reinforcement learning? arXiv preprint arxiv:2201.12122 .
[111]	Richardson, C., Sundar, A., & Heck, L. (2023). Syndicom: Improving conversa-tional commonsense with error-injection and natural language feedback. arXiv preprint arxiv:2309.10015 .
[112]	Roy, S., & Roth, D. (2016). Solving general arithmetic word problems. arXiv preprint arxiv:1608.01413 .
[113]	Russell, S. (1998). Learning agents for uncertain environments (extended abstract). In Pro-ceedings of the eleventh annual conference on computational learning theory (p. 101-103). New York, NY, USA: Association for Computing Machinery. Retrieved from https:// doi.org/10.1145/279943.279964 doi: 10.1145/279943.279964 · doi:10.1145/279943.279964
[114]	Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2020). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arxiv:1910.01108 .
[115]	Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arxiv:1707.06347 .
[116]	Shen, T., Jin, R., Huang, Y., Liu, C., Dong, W., Guo, Z., . . . Xiong, D. (2023). Large language model alignment: A survey. arXiv preprint arxiv:2309.15025 .
[117]	Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., & et al. (2017). Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arxiv:1712.01815 .
[118]	Solaiman, I., & Dennison, C. (2021). Process for adapting language models to society (palms) with values-targeted datasets. Advances in Neural Information Processing Sys-tems, 34 , 5861-5873.
[119]	Song, J., Zhou, Z., Liu, J., Fang, C., Shu, Z., & Ma, L. (2023). Self-refined large language model as automated reward function designer for deep reinforcement learning in robotics. arXiv preprint arxiv:2309.06687 .
[120]	Stepleton, T. (2017). The pycolab game engine. Retrieved from https://github.com/ deepmind/pycolab
[121]	Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., . . . Christiano, P. F. (2020). Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33 , 3008-3021.
[122]	Sun, H. (2023a). Offline prompt evaluation and optimization with inverse reinforcement learning. arXiv preprint arxiv:2309.06553 .
[123]	Sun, H. (2023b). Reinforcement learning in the era of llms: What is essential? what is needed? an rl perspective on rlhf, prompting, and beyond. arXiv preprint arxiv:2310.06147 .
[124]	Sutton, R., & Barto, A. (1998). Reinforcement learning: An introduction. IEEE Transac-tions on Neural Networks, 9 (5), 1054-1054. doi: 10.1109/TNN.1998.712192 · doi:10.1109/TNN.1998.712192
[125]	Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge, MA, USA: A Bradford Book. · Zbl 1407.68009
[126]	Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., . . . Vinyals, O. (2024). Gemini: A family of highly capable multimodal models. arXiv preprint arxiv:2312.11805 .
[127]	Thoppilan, R., Freitas, D. D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., . . . Le, Q. (2022). Lamda: Language models for dialog applications. arXiv preprint arxiv:2201.08239 .
[128]	Torfi, A., Shirvani, R. A., Keneshloo, Y., Tavaf, N., & Fox, E. A. (2020). Natural language processing advancements by deep learning: A survey. arXiv preprint arXiv:2003.01200 .
[129]	Toro Icarte, R., Klassen, T. Q., Valenzano, R., & McIlraith, S. A. (2018). Teaching multiple tasks to an rl agent using ltl. In Proceedings of the 17th international conference on autonomous agents and multiagent systems (p. 452-461). Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems.
[130]	Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., . . . Lample, G. (2023). Llama: Open and efficient foundation language models. arXiv preprint arxiv:2302.13971 .
[131]	Van Hasselt, H., Guez, A., & Silver, D. (2015, 09). Deep reinforcement learning with double q-learning. Proceedings of the AAAI Conference on Artificial Intelligence, 30 . doi: 10.1609/aaai.v30i1.10295 · doi:10.1609/aaai.v30i1.10295
[132]	Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., . . . Polo-sukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30 .
[133]	Wang, J., Huang, Y., Chen, C., Liu, Z., Wang, S., & Wang, Q. (2023). Software testing with large language model: Survey, landscape, and vision. arXiv preprint arxiv:2307.07221 .
[134]	Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., . . . Wen, J.-R. (2023). A survey on large language model based autonomous agents. arXiv preprint arxiv:2308.11432 .
[135]	Wang, S., Zhu, Y., Liu, H., Zheng, Z., Chen, C., & Li, J. (2023). Knowledge editing for large language models: A survey. arXiv preprint arxiv:2310.16218 .
[136]	Wang, X., Chen, G., Qian, G., Gao, P., Wei, X.-Y., Wang, Y., . . . Gao, W. (2023). Large-scale multi-modal pre-trained models: A comprehensive survey. arXiv preprint arxiv:2302.10035 .
[137]	Wang, Y., Zhong, W., Li, L., Mi, F., Zeng, X., Huang, W., . . . Liu, Q. (2023). Aligning large language models with human: A survey. arXiv preprint arxiv:2307.12966 .
[138]	Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., & Freitas, N. (2016, 20-22 Jun). Dueling network architectures for deep reinforcement learning. In M. F. Balcan & K. Q. Weinberger (Eds.), Proceedings of the 33rd international conference on machine learning (Vol. 48, pp. 1995-2003). New York, New York, USA: PMLR. Retrieved from https://proceedings.mlr.press/v48/wangf16.html
[139]	Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., . . . Fedus, W. (2022). Emergent abilities of large language models. arXiv preprint arxiv:2206.07682 .
[140]	Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., . . . Zhou, D. (2023). Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arxiv:2201.11903 .
[141]	Weiss, G. (1960). Dynamic programming and markov processes. ronald a. howard. technology press and wiley, new york, 1960. viii + 136 pp. illus. \(5.75. Science, 132 (3428), 667-667. Retrieved from https://www.science.org/doi/abs/10.1126/ science.132.3428.667.a doi: 10.1126/science.132.3428.667.\) · doi:10.1126/science.132.3428.667
[142]	Wu, H., Wang, M., Wu, J., Francis, F., Chang, Y.-H., Shavick, A., . . . others (2022). A survey on clinical natural language processing in the united kingdom from 2007 to 2022. NPJ digital medicine, 5 (1), 186.
[143]	Wu, L., Zheng, Z., Qiu, Z., Wang, H., Gu, H., Shen, T., . . . Chen, E. (2023). A survey on large language models for recommendation. arXiv preprint arxiv:2305.19860 .
[144]	Wu, Y., Prabhumoye, S., Min, S. Y., Bisk, Y., Salakhutdinov, R., Azaria, A., . . . Li, Y. (2023). Spring: Gpt-4 out-performs rl algorithms by studying papers and reasoning. arXiv preprint arxiv:2305.15486 .
[145]	Wu, Y., Tucker, G., & Nachum, O. (2019). Behavior regularized offline reinforcement learning. arXiv preprint arxiv:1911.11361 .
[146]	Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., . . . Gui, T. (2023). The rise and po-tential of large language model based agents: A survey. arXiv preprint arxiv:2309.07864 .
[147]	Xie, T., Zhao, S., Wu, C. H., Liu, Y., Luo, Q., Zhong, V., . . . Yu, T. (2023). Text2reward: Automated dense reward function generation for reinforcement learning. arXiv preprint arxiv:2309.11489 .
[148]	Yang, J., Jin, H., Tang, R., Han, X., Feng, Q., Jiang, H., . . . Hu, X. (2023). Harness-ing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arxiv:2304.13712 .
[149]	Yu, C., Liu, J., Nemati, S., & Yin, G. (2021, nov). Reinforcement learning in healthcare: A survey. ACM Comput. Surv., 55 (1). Retrieved from https://doi.org/10.1145/3477600 doi: 10.1145/3477600 · doi:10.1145/3477600
[150]	Yu, T., Quillen, D., He, Z., Julian, R., Narayan, A., Shively, H., . . . Levine, S. (2021). Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. arXiv preprint arxiv:1910.10897 .
[151]	Yuan, H., Zhang, C., Wang, H., Xie, F., Cai, P., Dong, H., & Lu, Z. (2023). Plan4mc: Skill reinforcement learning and planning for open-world minecraft tasks. arXiv preprint arxiv:2303.16563 .
[152]	Zeng, Z., Shi, H., Wu, Y., Hong, Z., et al. (2015). Survey of natural language processing techniques in bioinformatics. Computational and mathematical methods in medicine, 2015 .
[153]	Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S., . . . Wang, G. (2023). Instruction tuning for large language models: A survey. arXiv preprint arxiv:2308.10792 .
[154]	Zhang, T., Wang, X., Zhou, D., Schuurmans, D., & Gonzalez, J. E. (2022). Tempera: Test-time prompting via reinforcement learning. arXiv preprint arxiv:2211.11890 .
[155]	Zhang, W., & Lu, Z. (2023). Rladapter: Bridging large language models to reinforcement learning in open worlds. arXiv preprint arxiv:2309.17176 .
[156]	Zhao, H., Chen, H., Yang, F., Liu, N., Deng, H., Cai, H., . . . Du, M. (2023). Explainability for large language models: A survey. arXiv preprint arxiv:2309.01029 .
[157]	Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., . . . Wen, J.-R. (2023). A survey of large language models. arXiv preprint arxiv:2303.18223 .
[158]	Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., . . . Levy, O. (2023). Lima: Less is more for alignment. arXiv preprint arxiv:2305.11206 .
[159]	Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., & Ba, J. (2023). Large language models are human-level prompt engineers. arXiv preprint arxiv:2211.01910 .
[160]	Zhu, Y., Yuan, H., Wang, S., Liu, J., Liu, W., Deng, C., . . . Wen, J.-R. (2023). Large language models for information retrieval: A survey. arXiv preprint arxiv:2308.07107 .

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.