×

Off-policy confidence interval estimation with confounded Markov decision process. (English) Zbl 07820381

Summary: This article is concerned with constructing a confidence interval for a target policy’s value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. In this article, we show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy’s value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient off-policy value estimator that is robust to potential model misspecification and provide rigorous uncertainty quantification. Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies. A Python implementation of the proposed procedure is available at https://github.com/Mamba413/cope. Supplementary materials for this article are available online.

MSC:

62-XX Statistics

Software:

COPE; DualDICE; glmer

References:

[1] Bennett, A., and Kallus, N. (2021), “Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes,” arXiv:2110.15332.
[2] Bennett, A., Kallus, N., Li, L., and Mousavi, A. (2021), “Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with Latent Confounders,” in Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, (Vol. 130), eds. Banerjee, A. and Fukumizu, K., pp. 1999-2007, PMLR.
[3] Bickel, P. J., Klaassen, C. A., Bickel, P. J., Ritov, Y., Klaassen, J., Wellner, J. A., and Ritov, Y. (1993), Efficient and Adaptive Estimation for Semiparametric Models (Vol. 4), Baltimore, MD: Johns Hopkins University Press. · Zbl 0786.62001
[4] Chakraborty, B., and Murphy, S. A. (2014), “Dynamic Treatment Regimes,” Annual Review of Statistics and its Application, 1, 447-464. DOI: .
[5] Chernofsky, A., Bosch, R. J., and Lok, J. J. (2021), “Causal Mediation Analysis with Mediator Values Below an Assay Limit,” arXiv:2107.14782.
[6] Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018), “Double/Debiased Machine Learning for Treatment and Structural Parameters,” The Econometrics Journal, 21, C1-C68. DOI: . · Zbl 07565928
[7] Chernozhukov, V., Chetverikov, D., and Kato, K. (2014), “Gaussian Approximation of Suprema of Empirical Processes,” The Annals of Statistics, 42, 1564-1597. DOI: . · Zbl 1317.60038
[8] Dai, B., Nachum, O., Chow, Y., Li, L., Szepesvari, C., and Schuurmans, D. (2020), “Coindice: Off-policy Confidence Interval Estimation,” in Advances in Neural Information Processing Systems (Vol. 33).
[9] Elzayn, H., Jabbari, S., Jung, C., Kearns, M., Neel, S., Roth, A., and Schutzman, Z. (2019), “Fair Algorithms for Learning in Allocation Problems,” in Proceedings of the Conference on Fairness, Accountability, and Transparency, , pp. 170-179. DOI: .
[10] Ertefaie, A. (2014), “Constructing Dynamic Treatment Regimes in Infinite-Horizon Settings,” arXiv:1406.0764.
[11] Fan, J., Wang, Z., Xie, Y., and Yang, Z. (2020), “A Theoretical Analysis of Deep q-earning,” Learning for Dynamics and Control, PMLR, pp. 486-489.
[12] Feng, Y., Ren, T., Tang, Z., and Liu, Q. (2020), “Accountable Off-policy Evaluation with Kernel Bellman Statistics,” arXiv:2008.06668.
[13] Fulcher, I. R., Shpitser, I., Marealle, S., and Tchetgen Tchetgen, E. J. (2020), “Robust Inference on Population Indirect Causal Effects: The Generalized Front Door Criterion,” Journal of the Royal Statistical Society, Series B, 82, 199-214. DOI: . · Zbl 1440.62175
[14] Gottesman, O., Johansson, F., Komorowski, M., Faisal, A., Sontag, D., Doshi-Velez, F., and Celi, L. A. (2019), “Guidelines for Reinforcement Learning in Healthcare,” Nature Medicine, 25, 16-18. DOI: .
[15] Hao, B., Ji, X., Duan, Y., Lu, H., Szepesvári, C., and Wang, M. (2021), “Bootstrapping Statistical Inference for Off-policy Evaluation,” arXiv:2102.03607.
[16] Hu, X., Qian, M., Cheng, B., and Cheung, Y. K. (2020), “Personalized Policy Learning Using Longitudinal Mobile Health Data,” arXiv:2001.03258.
[17] Jiang, N., and Li, L. (2016), “Doubly Robust Off-Policy Value Evaluation for Reinforcement Learning,” International Conference on Machine Learning, PMLR, pp. 652-661.
[18] Jin, S. T., Kong, H., Wu, R., and Sui, D. Z. (2018), “Ridesourcing, the Sharing Economy, and the Future of Cities,” Cities, 76, 96-104. DOI: .
[19] Kallus, N., Mao, X., and Uehara, M. (2021), “Causal Inference under Unmeasured Confounding with Negative Controls: A Minimax Learning Approach,” arXiv:2103.14029.
[20] Kallus, N., and Uehara, M. (2019), “Efficiently Breaking the Curse of Horizon in Off-policy Evaluation with Double Reinforcement Learning,”arXiv arXiv-1909.
[21] Kallus, N., and Zhou, A. (2020), “Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement Learning,” in Advances in Neural Information Processing Systems (Vol. 33), eds. Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F. and Lin, H., pp. 22293-22304, Curran Associates, Inc.
[22] Kober, J., Bagnell, J. A., and Peters, J. (2013), “Reinforcement Learning in Robotics: A Survey,” The International Journal of Robotics Research, 32, 1238-1274. DOI: .
[23] Kosorok, M. R., and Laber, E. B. (2019), “Precision Medicine,” Annual Review of Statistics and its Application, 6, 263-286. DOI: .
[24] Le, H., Voloshin, C., and Yue, Y. (2019), “Batch Policy Learning under Constraints,” International Conference on Machine Learning, , pp. 3703-3712.
[25] Li, C., Chan, S. H., and Chen, Y.-T. (2020), “Who make Drivers Stop? Towards Driver-Centric Risk Assessment: Risk Object Identification via Causal Inference,” arXiv:2003.02425.
[26] Li, J., Monroe, W., Ritter, A., Galley, M., Gao, J., and Jurafsky, D. (2016), “Deep Reinforcement Learning for Dialogue Generation,” arXiv:1606.01541.
[27] Liao, P., Klasnja, P., and Murphy, S. (2021), “Off-Policy Estimation of Long-Term Average Outcomes with Applications to Mobile Health,” Journal of the American Statistical Association, 116, 382-391. DOI: . · Zbl 1457.62055
[28] Liao, P., Qi, Z., and Murphy, S. (2020), “Batch Policy Learning in Average Reward Markov Decision Processes,” arXiv:2007.11771.
[29] Liu, Q., Li, L., Tang, Z., and Zhou, D. (2018), “Breaking the Curse of Horizon: Infinite-Horizon Off-policy Estimation,” in Advances in Neural Information Processing Systems (Vol. 31), pp. 5356-5366.
[30] Luckett, D. J., Laber, E. B., Kahkoska, A. R., Maahs, D. M., Mayer-Davis, E., and Kosorok, M. R. (2020),“Estimating Dynamic Treatment Regimes in Mobile Health Using v-Learning,” Journal of the American Statistical Association, 115, 692-706. DOI: . · Zbl 1445.62279
[31] Luedtke, A. R., and Van Der Laan, M. J. (2016), “Statistical Inference for the Mean Outcome under a Possibly Non-unique Optimal Treatment Strategy,” Annals of Statistics, 44, 713-742. · Zbl 1338.62089
[32] Mandel, T., Liu, Y.-E., Levine, S., Brunskill, E., and Popovic, Z. (2014), “Offline Policy Evaluation across Representations with Applications to Educational Games,” in AAMAS, pp. 1077-1084.
[33] Matsouaka, R. A., Li, J. and Cai, T. (2014), “Evaluating Marker-Guided Treatment Selection Strategies,” Biometrics, 70, 489-499. DOI: . · Zbl 1299.62129
[34] Murphy, S. A. (2003), “Optimal Dynamic Treatment Regimes,” Journal of the Royal Statistical Society, Series B, 65, 331-355. DOI: . · Zbl 1065.62006
[35] Nachum, O., Chow, Y., Dai, B., and Li, L. (2019), “Dualdice: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections,” in Advances in Neural Information Processing Systems (Vol. 32), pp. 2318-2328.
[36] Nair, Y., and Jiang, N. (2021), “A Spectral Approach to Off-policy Evaluation for POMDPs,” arXiv:2109.10502.
[37] Namkoong, H., Keramati, R., Yadlowsky, S., and Brunskill, E. (2020), “Off-policy Policy Evaluation for Sequential Decisions under Unobserved Confounding,” in Advances in Neural Information Processing Systems (Vol. 33), 18819-18831.
[38] Pearl, J. (2009), “Causality, Cambridge: Cambridge University Press. · Zbl 1188.68291
[39] Qi, Z., and Liao, P. (2020), “Robust Batch Policy Learning in Markov Decision Processes,” arXiv:2011.04185.
[40] Robins, J. M. (2004), “Optimal Structural Nested Models for Optimal Sequential Decisions,” in Proceedings of the Second Seattle Symposium in Biostatistics, , pp. 189-326, Springer. · Zbl 1279.62024
[41] Rysman, M. (2009), “The Economics of Two-Sided Markets,” Journal of Economic Perspective, 23, 125-143. DOI: .
[42] Schmidt-Hieber, J. (2020), “Nonparametric Regression using Deep Neural Networks with ReLU Activation Function,” Annals of Statistics, 48, 1875-1897. · Zbl 1459.62059
[43] Shi, C., Uehara, M., Huang, J., and Jiang, N. (2022), “A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes,” International Conference on Machine Learning, , pp. 20057-20094, PMLR.
[44] Shi, C., Wan, R., Chernozhukov, V., and Song, R. (2021), “Deeply-Debiased Off-policy Interval Estimation,” in Proceedings of the 38th International Conference on Machine Learning, eds. Meila, M. and Zhang, T. (Vol. 139), pp. 9580-9591, PMLR.
[45] Shi, C., Zhang, S., Lu, W., and Song, R. (2022), “Statistical Inference of the Value Function for Reinforcement Learning in Infinite Horizon Settings,” Journal of the Royal Statistical Society, Series B, 84, 765-793. DOI: . · Zbl 07909595
[46] Shi, X., Miao, W., Nelson, J. C., and Tchetgen Tchetgen, E. J. (2020), “Multiply Robust Causal Inference with Double-Negative Control Adjustment for Categorical Unmeasured Confounding,” Journal of the Royal Statistical Society, Series B, 82, 521-540. DOI: . · Zbl 07554764
[47] Sutton, R. S., and Barto, A. G. (2018), Reinforcement learning: An introduction, Cambridge, MA: MIT Press. · Zbl 1407.68009
[48] Tchetgen Tchetgen, E. J., Ying, A., Cui, Y., Shi, X., and Miao, W. (2020), “An Introduction to Proximal Causal Learning,” arXiv:2009.10982.
[49] Tennenholtz, G., Shalit, U., and Mannor, S. (2020), “Off-policy Evaluation in Partially Observable Environments,” AAAI, pp. 10276-10283. DOI: .
[50] Thomas, P. S., Theocharous, G., and Ghavamzadeh, M. (2015), “High-Confidence Off-policy Evaluation,” Twenty-Ninth AAAI Conference on Artificial Intelligence. DOI: .
[51] Tsiatis, A. (2007), Semiparametric Theory and Missing Data, New York: Springer.
[52] Tsiatis, A. A., Davidian, M., Holloway, S. T., and Laber, E. B. (2019), Dynamic Treatment Regimes: Statistical Methods for Precision Medicine, Boca Raton, FL: CRC Press.
[53] Uehara, M., Huang, J., and Jiang, N. (2020), “Minimax Weight and q-function Learning for Off-policy Evaluation,” in Proceedings of the 37th International Conference on Machine Learning (Vol. 119), eds. D., H. III and Singh, A., PMLR, pp. 9659-9668.
[54] Uehara, M., Imaizumi, M., Jiang, N., Kallus, N., Sun, W., and Xie, T. (2021), “Finite Sample Analysis of Minimax Offline Reinforcement Learning: Completeness, Fast Rates and First-Order Efficiency,” arXiv:2102.02981.
[55] Van der Vaart, A. W. (2000), Asymptotic Statistics (Vol. 3), Cambridge: Cambridge University Press. · Zbl 0943.62002
[56] Wager, S., and Athey, S. (2018), “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests,” Journal of the American Statistical Association, 113, 1228-1242. DOI: . · Zbl 1402.62056
[57] Wang, L., Yang, Z., and Wang, Z. (2020), “Provably Efficient Causal Reinforcement Learning with Confounded Observational Data,” arXiv:2006.12311.
[58] Wang, L., Zhou, Y., Song, R., and Sherwood, B. (2018), “Quantile-Optimal Treatment Regimes,” Journal of the American Statistical Association, 113, 1243-1254. DOI: . · Zbl 1402.62294
[59] Wu, Y., and Wang, L. (2020), “Resampling-based Confidence Intervals for Model-Free Robust Inference on Optimal Treatment Regimes,” Biometrics, 77, 465-476. DOI: . · Zbl 1520.62378
[60] Xu, Z., Laber, E., Staicu, A.-M., and Severus, E. (2020), “Latent-State Models for Precision Medicine,” arXiv:2005.13001.
[61] Xu, Z., Li, Z., Guan, Q., Zhang, D., Li, Q., Nan, J., Liu, C., Bian, W., and Ye, J. (2018), “Large-Scale Order Dispatch in On-demand Ride-Hailing Platforms: A Learning and Planning Approach,” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, , pp. 905-913.
[62] Zhang, B., Tsiatis, A. A., Laber, E. B., and Davidian, M. (2012), “A Robust Method for Estimating Optimal Treatment Regimes,” Biometrics, 68, 1010-1018. DOI: . · Zbl 1258.62116
[63] Zhang, B., Tsiatis, A. A., Laber, E. B., and Davidian, M. (2013), “Robust Estimation of Optimal Dynamic Treatment Regimes for Sequential Treatment Decisions,” Biometrika, 100, 681-694. DOI: . · Zbl 1284.62508
[64] Zhang, J., and Bareinboim, E. (2016)“Markov Decision Processes with Unobserved Confounders: A Causal Approach,” Technical Report, Technical Report R-23, Purdue AI Lab.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.