In distributional reinforcement learning (RL), not only expected returns but the complete return distributions of a policy are taken into account. The return distribution for a fixed policy is given as the solution of an associated distributional Bellman equation. In this note, we consider general distributional Bellman equations and study the existence and uniqueness of their solutions, as well as tail properties of return distributions. We give necessary and sufficient conditions for the existence and uniqueness of return distributions and identify cases of regular variation.
We link distributional Bellman equations to multivariate affine distributional equations. We show that any solution of a distributional Bellman equation can be obtained as the vector of marginal laws of a solution to a multivariate affine distributional equation. This makes the general theory of such equations applicable to the distributional reinforcement learning setting.
Citation: Julian Gerstenberg, Ralph Neininger, Denis Spiegel. On solutions of the distributional Bellman equation[J]. Electronic Research Archive, 2023, 31(8): 4459-4483. doi: 10.3934/era.2023228
In distributional reinforcement learning (RL), not only expected returns but the complete return distributions of a policy are taken into account. The return distribution for a fixed policy is given as the solution of an associated distributional Bellman equation. In this note, we consider general distributional Bellman equations and study the existence and uniqueness of their solutions, as well as tail properties of return distributions. We give necessary and sufficient conditions for the existence and uniqueness of return distributions and identify cases of regular variation.
We link distributional Bellman equations to multivariate affine distributional equations. We show that any solution of a distributional Bellman equation can be obtained as the vector of marginal laws of a solution to a multivariate affine distributional equation. This makes the general theory of such equations applicable to the distributional reinforcement learning setting.
[1] | M. G. Bellemare, W. Dabney, R. Munos, A distributional perspective on reinforcement learning, in International Conference on Machine Learning, (2017), 449–458. |
[2] | M. G. Bellemare, W. Dabney, M. Rowland, Distributional Reinforcement Learning, MIT Press, 2023. |
[3] | M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons, Inc., New York, 1994. https://doi.org/10.1002/9780470316887 |
[4] | M. Rowland, M. Bellemare, W. Dabney, R. Munos, Y. W. Teh, An analysis of categorical distributional reinforcement learning, in International Conference on Artificial Intelligence and Statistics, (2018), 29–37. |
[5] | E. Krasheninnikova, J. García, R. Maestre, F. Fernández, Reinforcement learning for pricing strategy optimization in the insurance industry, Eng. Appl. Artif. Intell., 80 (2019), 8–19. https://doi.org/10.1016/j.engappai.2019.01.010 doi: 10.1016/j.engappai.2019.01.010 |
[6] | P. N. Kolm, G. Ritter, Modern perspectives on reinforcement learning in finance, J. Mach. Learn. Finance, 1 (2019). https://doi.org/10.2139/ssrn.3449401 |
[7] | P. Embrechts, C. Klüppelberg, T. Mikosch, Modelling Extremal Events: For Insurance and Finance, Springer-Verlag, Berlin, 1997. https://doi.org/10.1007/978-3-642-33483-2 |
[8] | V. Zhuang, Y. Sui, No-regret reinforcement learning with heavy-tailed rewards, in Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, (2021), 3385–3393. |
[9] | A. M. Medina, S. Yang, No-regret algorithms for heavy-tailed linear bandits, in Proceedings of The 33rd International Conference on Machine Learning (eds. M. F. Balcan and K. Q. Weinberger), PMLR, (2016), 1642–1650. https://proceedings.mlr.press/v48/medina16.html |
[10] | X. Yu, H. Shao, M. R. Lyu, I. King, Pure exploration of multi-armed bandits with heavy-tailed payoffs, in Conference on Uncertainty in Artificial Intelligence, (2018), 937–946. |
[11] | H. Shao, X. Yu, I. King, M. R. Lyu, Almost optimal algorithms for linear stochastic bandits with heavy-tailed payoffs, in Advances in Neural Information Processing Systems (eds. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett), Curran Associates, Inc., 2018. |
[12] | S. Lu, G. Wang, Y. Hu, L. Zhang, Optimal algorithms for lipschitz bandits with heavy-tailed rewards, in International Conference on Machine Learning, PMLR, (2019), 4154–4163. |
[13] | S. R. Chowdhury, A. Gopalan, Bayesian optimization under heavy-tailed payoffs, in Advances in Neural Information Processing Systems (eds. H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingle Alché-Buc, E. Fox and R. Garnett), Curran Associates, Inc., 2019. |
[14] | A. Dubey, A. Pentland, Thompson sampling on symmetric alpha-stable bandits, in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, (2019), 5715–5721. https://doi.org/10.24963/ijcai.2019/792 |
[15] | G. Alsmeyer, F. Buckmann, Stability of perpetuities in Markovian environment, J. Differ. Equations Appl., 23 (2017), 699–740. https://doi.org/10.1080/10236198.2016.1271878 doi: 10.1080/10236198.2016.1271878 |
[16] | D. Buraczewski, E. Damek, T. Mikosch, Stochastic Models with Power-law Tails, Springer, 2016, https://doi.org/10.1007/978-3-319-29679-1 |
[17] | T. Morimura, M. Sugiyama, H. Kashima, H. Hachiya, T. Tanaka, Nonparametric return distribution approximation for reinforcement learning, in Proceedings of the 27th International Conference on International Conference on Machine Learning, (2010), 799–806. |
[18] | K. J. Chung, M. J. Sobel, Discounted mdp's: Distribution functions and exponential utility maximization, SIAM J. Control Optim., 25 (1987), 49–62. https://doi.org/10.1137/0325004 doi: 10.1137/0325004 |
[19] | W. Vervaat, On a stochastic difference equation and a representation of nonnegative infinitely divisible random variables, Adv. Appl. Probab., 11 (1979), 750–783. https://doi.org/10.2307/1426858 doi: 10.2307/1426858 |
[20] | C. M. Goldie, R. Grübel, Perpetuities with thin tails, Adv. Appl. Probab., 28 (1996), 463–480, https://doi.org/10.2307/1428067. doi: 10.2307/1428067 |
[21] | C. M. Goldie, R. A. Maller, Stability of perpetuities, Ann. Probab., 28 (2000), 1195–1218. https://doi.org/10.1214/aop/1019160331 doi: 10.1214/aop/1019160331 |
[22] | U. Rösler, A fixed point theorem for distributions, Stochastic Process. Appl., 42 (1992), 195–214. https://doi.org/10.1016/0304-4149(92)90035-O doi: 10.1016/0304-4149(92)90035-O |
[23] | N. H. Bingham, C. M. Goldie, J. L. Teugels, Regular Variation, Cambridge University Press, Cambridge, 1987. https://doi.org/10.1017/CBO9780511721434 |
[24] | D. Cline, Infinite series of random variables with regularly varying tails, Inst. Appl. Math. Statist., (1983). |
[25] | A. Brandt, The stochastic equation $Y_{n+1} = A_nY_n+B_n$ with stationary coefficients, Adv. Appl. Probab., 18 (1986), 211–220. https://doi.org/10.2307/1427243 doi: 10.2307/1427243 |
[26] | T. Erhardsson, Conditions for convergence of random coefficient AR(1) processes and perpetuities in higher dimensions, Bernoulli, 20 (2014), 990–1005. https://doi.org/10.3150/13-BEJ513 doi: 10.3150/13-BEJ513 |
[27] | O. Kallenberg, Foundations of Modern Probability, Springer, 2021. https://doi.org/10.1007/978-3-030-61871-1 |
[28] | P. Bougerol, N. Picard, Strict stationarity of generalized autoregressive processes, Ann. Probab., 20 (1992), 1714–1730. |
[29] | T. Mikosch, Regular Variation, Subexponentiality and Their Applications in Probability Theory, Eindhoven University of Technology Eindhoven, The Netherlands, 1999. |