The optimal probability of the risk for finite horizon partially observable Markov decision processes

Xian Wen; Haifeng Huo; Jinhua Cui; Xian Wen; Haifeng Huo; Jinhua Cui

doi:10.3934/math.20231455

AIMS Mathematics

2023, Volume 8, Issue 12: 28435-28449. doi: 10.3934/math.20231455

Previous Article Next Article

Research article

The optimal probability of the risk for finite horizon partially observable Markov decision processes

School of Science, Guangxi University of Science and Technology, Liuzhou 541006, China

Received: 31 May 2023 Revised: 06 September 2023 Accepted: 21 September 2023 Published: 18 October 2023
MSC : 60J27, 90C40

This paper investigates the optimality of the risk probability for finite horizon partially observable discrete-time Markov decision processes (POMDPs). The probability of the risk is optimized based on the criterion of total rewards not exceeding the preset goal value, which is different from the optimal problem of expected rewards. Based on the Bayes operator and the filter equations, the optimization problem of risk probability can be equivalently reformulated as filtered Markov decision processes. As an advantage of developing the value iteration technique, the optimality equation satisfied by the value function is established and the existence of the risk probability optimal policy is proven. Finally, an example is given to illustrate the effectiveness of using the value iteration algorithm to compute the value function and optimal policy.

Keywords:

partially observable Markov decision processes,
risk probability criterion,
Bayes operator,
the optimal policy

Citation: Xian Wen, Haifeng Huo, Jinhua Cui. The optimal probability of the risk for finite horizon partially observable Markov decision processes[J]. AIMS Mathematics, 2023, 8(12): 28435-28449. doi: 10.3934/math.20231455

Related Papers:

[1]	Xuan Jia, Junfeng Zhang, Tarek Raïssi . Linear programming-based stochastic stabilization of hidden semi-Markov jump positive systems. AIMS Mathematics, 2024, 9(10): 26483-26498. doi: 10.3934/math.20241289
[2]	Andrey Borisov . Filtering of hidden Markov renewal processes by continuous and counting observations. AIMS Mathematics, 2024, 9(11): 30073-30099. doi: 10.3934/math.20241453
[3]	Xiao Wu, Qi Wang, Yinying Kong . Two-person zero-sum stochastic games with varying discount factors. AIMS Mathematics, 2021, 6(10): 11516-11529. doi: 10.3934/math.2021668
[4]	Haifeng Zheng, Dan Wang . A study of value iteration and policy iteration for Markov decision processes in Deterministic systems. AIMS Mathematics, 2024, 9(12): 33818-33842. doi: 10.3934/math.20241613
[5]	A. Joumad, A. El Moutaouakkil, A. Nasroallah, O. Boutkhoum, Mejdl Safran, Sultan Alfarhood, Imran Ashraf . Unsupervised segmentation of images using bi-dimensional pairwise Markov chains model. AIMS Mathematics, 2024, 9(11): 31057-31086. doi: 10.3934/math.20241498
[6]	Hui Sun, Zhongyang Sun, Ya Huang . Equilibrium investment and risk control for an insurer with non-Markovian regime-switching and no-shorting constraints. AIMS Mathematics, 2020, 5(6): 6996-7013. doi: 10.3934/math.2020449
[7]	Wanlu Zhang, Hui Meng . Robust optimal investment-reinsurance strategies with the preferred reinsurance level of reinsurer. AIMS Mathematics, 2022, 7(6): 10024-10051. doi: 10.3934/math.2022559
[8]	Lin Xu, Linlin Wang, Hao Wang, Liming Zhang . Optimal investment game for two regulated players with regime switching. AIMS Mathematics, 2024, 9(12): 34674-34704. doi: 10.3934/math.20241651
[9]	Li Deng, Zhichao Chen . Optimal dividends in a discrete-time dual risk model with stochastic expenses. AIMS Mathematics, 2024, 9(11): 31696-31720. doi: 10.3934/math.20241524
[10]	Kai Xiao . Risk-seeking insider trading with partial observation in continuous time. AIMS Mathematics, 2023, 8(11): 28143-28152. doi: 10.3934/math.20231440

Abstract

1. Introduction

Analyzing the risk performance of a stochastic dynamic system is an important optimization control problem. Additionally, both theoretical and applied aspects are observed in relation to financial insurance ^[1], communication networks ^[2] and queuing systems ^[3]. Since the conventional expectation criterion could not effectively reflect the risk performance of the system, the criteria of risk probability were first proposed by Sobel ^[4], and implemented in Markov decision processes (MDPs). Afterwards, many scholars focused on the research of optimization problems of risk probability in MDPs. According to the characteristics of the sojourn time of the system state, these existing studies can be roughly divided into four categories: (i) Discrete-time Markov decision processes (DTMDPs)^[5,6,7,8]; (ii) Semi-Markov decision processes (SMDPs) ^[9,10,11]; (iii) Continuous-time Markov decision processes (CTMDPs)^[12,13,14]; and (iv) Piecewise deterministic Markov decision processes (PDMDPs)^[15]. A common feature of these existing literatures is that the system state is completely observable. However, in practical applications such as machine maintenance and finance, the traditional models of MDPs cannot effectively depict these practical problems because the information of the decision environment cannot be completely observed or perceived. Therefore, it is necessary to establish partially observable Markov decision processes (POMDPs) to optimize the risk probability of the control system.

Compared with completely observable Markov decision processes (COMDPs), the model of POMDPs is a more extensive stochastic control model with important theoretical significance and practical application values, and is widely used in fields such as industry, computational science, finance, and artificial intelligence. Therefore, many scholars began to focus on the problem of expected optimal for POMDPs. More specifically, Drake ^[16] established the POMDPs model, which attracted the attention of many experts and scholars. Regarding the expected optimal problem, Hinderer ^[17] discussed the finite state situation. Rhenius ^[18] and Hern $\acute{a}$ ndez-Lema ^[19] discussed a more general state situation. Smallwood and Sondik ^[20] further expanded the algorithm to calculate the optimal strategies and value function (VF) by employing the dynamic programming method. Sawaki and Ichikawa ^[21] proposed a successive approximation method to calculate the optimal strategy and VF. White and Scherer ^[22] solved the infinite discounted optimization problem by modifying the reward function and employing the iterative approximation algorithm. B $\ddot{a}$ uerle and Rieder ^[1] established the optimality equation by equivalently converting POMDPs into a filter MDPs model and proved the existence of an optimal strategy. Feinberg et al. ^[23] established some sufficient conditions to assure the existence of optimal strategies and an optimality equation for more general state and action spaces. In addition, many scholars have focused on the research of computational algorithms for the POMDPs ^[24]. However, these criteria mainly focus on the expected value of the total rewards, which could not effectively describe the risk situation faced by the control system. Therefore, it is necessary to introduce the criteria of the risk probability that can effectively demonstrate the risk performance of the system. An overview of the existing literature indicates that the criterion of the risk probability concerning POMDPs has not been researched thus far. This paper is the first attempt to solve the optimization problem of the risk probability for POMDPs.

The optimization problem intends to minimize the risk probability criterion, that is, the probability value of the system's total rewards does not exceed the the profit goal. Because the reward levels are regarded as the second component of the extended state, it is necessary to redescribe the evolution process of the system and redefine the history-dependent, Markov and stationary policies. Thus, for any given redefined policy, a new probability space must be reconstructed using the well-known Ionescu Tulcea's theorem (see e.g., Proposition 7.45 in ^[25]), which is based on any initial system state and reward goal. Second, the unobservable state's conditional probability distribution is constructed, by redefining the Bayes operator (including the reward levels) and establishing the filter equations. Then, based on the aforementioned conditional probability distribution, a new filtered risk probability MDPs model is established by expanding the state and action space and modifying the transfer kernel and reward function. Furthermore, we prove that the newly filtered MDPs model can reveal the regular relationship between partially and completely observable optimal problems. On account of risk probability optimality theory for COMDPs, by using the value iteration advanced technique, the optimality equation is established, and the existence of optimal policies is proven. Finally, a machine maintenance example is given to present our main results, which include using the iteration algorithm to calculate the value function and an optimal policy.

The rest of the manuscript is outlined as follows. Section 2 presents a minimization risk probability problem dealing for POMDPs. Section 3 presents the main results, including the existence of the optimality equation and optimal policies. An illustration is given to present the value iteration algorithm for calculating the VF and OP in Section 4.

2. The model of POMDPs

The model of POMDPs consists of the following elements:

$\begin{eqnarray} \{E_{X}\times E_{Y},\{A(x)\subseteq A,x\in E_{X}\},Q(\cdot,\cdot\mid x,y,a),r(x,y,a),Q_{0}\} \end{eqnarray}$

(2.1)

which have the following meanings:

(a) $E_{X}\times E_{Y}$ represents a Borel space with a Borel $\sigma$ -algebra $\mathscr{B}(E_{X}\times E_{Y})$ . The element $(x, y)\in E_{X}\times E_{Y}$ is the system state, where $x$ denotes the state's observable portion, and $y$ denotes the state's unobservable portion.

(b) $A$ represents the action space of a Borel space with a Borel $\sigma$ -algebra $\mathscr{A}$ . $A(x)\subseteq A$ represents the set of admissible actions in state $x\in E_{X}$ , which is assumed to be finite. Moreover, the set of all composable pairs of state actions is denoted by $\mathbb{K}: = \{(x, a)\mid x\in E_{X}, a\in A(x)\}$ .

(c) $Q(\cdot, \cdot\mid x, y, a)$ denotes the probability of the transition from $E_{X}\times E_{Y}\times A$ to $E_{X}\times E_{Y}$ , which is used to describe the transition mechanism in the controlled state process. For simplicity, we introduce $Q^{X}$ to represent the marginal transition probability $Q^{X}(B|x, y, a): = Q^{X}(B\times E_{Y}|x, y, a)$ .

(e) $r(x, y, a)$ denotes a nonnegative real-valued measurable reward function from $\mathbb{K}\times E_{Y}$ to $R^{+}: = [0, +\infty)$ .

(d) $Q_{0}$ denotes the initial probability distribution of the unobservable state.

The evolution of the risk probability POMDP is characterized as follows: At $s_{0} = 0$ , based on the observed state $x_{0}$ and the reward goal (reward level) $\tilde{\lambda}_{0}: = \lambda_{0}$ , the decision maker could pick an action $a_{0}$ from the set of allowed actions $A(x_{0})$ . Then, the observed state of the system stays until time $s_{1} = 1$ , at which point, the system state transfers to the state $x_{1}\in B_{1}\subseteq E_{X}$ based on the probability of the transition $\int_{B_{1}}\int_{E_{Y}}Q(x_{1}, y_{1}\mid x_{0}, y_{0}, a_{0})Q_{0}({\mathrm d}y_{0})$ . Meanwhile, the unobservable state $y_{0}$ also transfers to the next state $y_{1}$ with a certain probability, which is constructed by the Bayes operator in the undermentioned (3.3). Moreover, during this period, the control system will generate the rewards $r(x_{0}, y_{0}, a_{0})$ . Then, the goal of the corresponding reward would become $\tilde{\lambda}_{1} = \lambda_{0}-r(x_{0}, y_{0}, a_{0})$ . At the new decision time $s_{1} = 1$ , based on the observable information of the system $h_{1} = (x_{0}, \lambda_{0}, a_{0}, x_{1}, \tilde{\lambda}_{1})$ , the decision maker picks a new action $a_{1}\in A(x_{1})$ . Afterward, the system evolves similarly and produces a so-called observable history up to time $s_{k} = k$ :

$\begin{eqnarray} h_{k}: = (x_{0},\lambda_{0},a_{0},x_{1},\tilde{\lambda}_{1},a_{1},\ldots,x_{k},\tilde{\lambda}_{k}), \end{eqnarray}$

(2.2)

where $x_{k}, y_{k}$ denote the system state's observable and unobservable section at the $k$ -th moment of decision, respectively, $a_{k}$ represents the action chosen by the decision maker at time $s_{k} = k$ , $\tilde{\lambda}_{k}$ denotes the reward goal, which means that the decision maker will try his/her best to regulate the total rewards not exceeding the goal of the reward, and it conforms to the following relation:

$\begin{eqnarray} \tilde{\lambda}_{k+1}: = \tilde{\lambda}_{k}-r(x_{k},y_{k},a_{k}), \end{eqnarray}$

(2.3)

for $k = 0, 1, \ldots.$

The sets of all the histories of observable $h_{k}$ are represented by $H_{0}: = E_{X}\times R$ , $H_{k}: = H_{k-1}\times A\times E_{X}\times R$ for $k\geq1$ . Based on all observable histories, some policies are introduced.

Definition 2.1. (a) A sequence $\pi = \{\pi_{k}, k\geq0\}$ is said to be a randomized history-dependent policy if a stochastic kernel $\pi_{k}: H_{k}\rightarrow A$ satisfying the following:

$\begin{eqnarray*} \label{zss1} \pi_{k}(A(x_{k})|h_{k}) = 1\ \ \ {\rm for \ all} \ h_{k}\in H_{k},k = 0,1,2\ldots. \end{eqnarray*}$

(b) A randomized history-dependent policy is said to be deterministic if there exists a sequence $\{g_{k}\}$ of measurable functions $g_{k}$ from $H_{k}$ to $A$ with $g_{k}(h_{k})\in A(x_{k})$ , and $\pi_{k}(\cdot|h_{k})$ is the Dirac measure at $g_{k}(h_{k})$ for all $h_{k}\in H_{k}, k\geq0$ . The sets of all the randomized, deterministic policies are represented by $\Pi, \Pi_{DH}$ , respectively.

The risk probability POMDP needs to consider the system state and the reward goal, which is different from the conventional expectation MDP that only considers the system state. Thus, the results of the available classical expectation MDP cannot be applied to the proposed model. First, we need to reconstruct the measurable space $(\Omega, \mathscr{F})$ as follows: $\Omega: = \{(x_{0}, y_{0}, \lambda_{0}, a_{0}, x_{1}, y_{1}, \lambda_{1}, a_{1}, \ldots, x_{k}, y_{k}, \lambda_{k} \; a_{k}, \ldots, )| x_0\in E_{X}, y_{0}\in E_{Y}, \lambda_{0}\in R, a_{0}\in A, x_{l}\in E, y_{l}\in E_{Y}, \lambda_{l}\in R, a_{l}\in A$ with $1\leq l\leq k, k\geq1 \}$ denotes the sample space, which is endowed with the Borel $\sigma$ -algebra $\mathscr{F}$ . For any $\omega: = (x_{0}, y_{0}, \lambda_{0}, a_{0}, x_{1}, y_{1}, \lambda_{1}, a_{1}, \ldots, x_{k}, y_{k}, \lambda_{k} \; a_{k}, \ldots, )\in \Omega$ , some random variables are defined as follows:

$X_k(\omega): = x_k, Y_k(\omega): = y_k, \Lambda_{k}(\omega): = \lambda_{k}, A_{k}(\omega): = a_{k},k\geq0.$

The $\omega$ will be omitted for convenience.

For any policy $\pi\in\Pi$ , $(x, \lambda)\in E_{X}\times R$ , $Q_{0}$ of $Y_{0}$ on $E_{Y}$ , based on the Ionescu Tulcea's (e.g., Proposition 7.45 in ^[25]), the unique probability measure $P_{(x, \lambda)}^{\pi} = \int_{E_{Y}}P_{(x, \lambda, y)}^{\pi}(\cdot)Q_{0}({\mathrm d}y)$ on $(\Omega, \mathscr{F})$ is constructed as follows: for all $B\in \mathscr{B}(E_{X}), C\in \mathscr{B}(E_{Y}), G\in \mathscr{B}(R), D\in \mathscr{B}(A), h_{k}\in H_{k}, k = 0, 1, \ldots$

$\begin{eqnarray} P_{(x,\lambda,y)}^{\pi}(X_{0} = x,\Lambda_{0} = \lambda)& = &1, \end{eqnarray}$

(2.4)

$\begin{eqnarray} P_{(x,\lambda,y)}^{\pi}(A_{k}\in D|h_{k})& = &\int_{D}\pi_{k}({\mathrm d}a_{k}|h_{k}), \end{eqnarray}$

(2.5)

$\begin{eqnarray} P_{(x,\lambda,y)}^{\pi}(X_{k+1}\in B,Y_{k+1}\in C,\Lambda_{k+1}\in G|h_{k},y_{k},a_{k})& = &\int_{B}\int_{C} \int_{G}Q({\mathrm d}x_{k+1},{\mathrm d}y_{k+1}|x_{k},y_{k},a_{k}) \\ &&\times I_{[\lambda_{k}-r(x_{k},y_{k},a_{k})]}({\mathrm d}\lambda_{k+1}).\nonumber \end{eqnarray}$

(2.6)

The expectation operator corresponding to the probability measure $P_{(x, \lambda)}^\pi$ can be expressed as $\mathbb{E}_{(x, \lambda)}^\pi$ .

For any $(x, \lambda)\in E_{X}\times R$ and $\pi\in\Pi$ , the risk probability criterion of POMDPs is defined as follows:

$\begin{eqnarray} F^{\pi}_{N}(x,\lambda) = \int_{E_{Y}}P^{\pi}_{(x,y,\lambda)}\Big(\sum\limits_{n = 0}^{N}r(X_{n},Y_{n},A_{n})\leq\lambda\Big)Q_{0}({\mathrm d}y). \end{eqnarray}$

(2.7)

Then, $F^{*}_{N}(x, \lambda): = \inf_{\pi\in\Pi}F^{\pi}_{N}(x, \lambda)$ is known as the risk probability value function.

Definition 2.2. A policy $\pi^{*}\in\Pi$ is called the optimal of the risk probability if

$\begin{eqnarray} F_{N}^{\pi^{*}}(x,\lambda) = F_{N}^{*}(x,\lambda)\ \ \ {\rm for \ all} \ (x,\lambda)\in E\times R. \end{eqnarray}$

(2.8)

3. Main result

The general objective of the manuscript is to optimize the criterion of the risk probability and establish both the optimality equation's solution and the existence of the strategy's conditions.

Notation: Let $\mathbb{P}(E_{Y})$ be the space of all the probability measures on $E_{Y}$ .

Assumption 3.1. Assume that the transition probability has a probability density function $q$ that satisfies $Q(d(x^{'}, y^{'})|x, y, a) = q(x^{'}, y^{'}|x, y, a)\rho({\mathrm d}x^{'})\nu({\mathrm d}y^{'})$ for some $\sigma$ -finite measures $\rho$ and $\nu$ .

The Bayes operator $\Phi:E_{X}\times R \times A\times E_{X}\times R\times\mathbb{P}(E_{Y})\rightarrow \mathbb{P}(E_{Y})$ is first defined as follows:

$\begin{eqnarray} \Phi(x,\lambda,a,x^{'},\lambda^{'},\mu)(C): = \frac{\int_{C}\int_{E_{Y}} q(x^{'},y^{'}|x,y,a)I_{\{ \lambda-r(x,y,a)\}}(\lambda^{'})\mu({\mathrm d}y)\nu({\mathrm d}y^{'})}{\int_{E_{Y}}\int_{E_{Y}} q(x^{'},y^{'}|x,y,a)\mu({\mathrm d}y)\nu({\mathrm d}y^{'})}, \end{eqnarray}$

(3.1)

where $C\in\mathscr{B}(E_{Y})$ , $\mu$ denotes the distribution of the unobservable state. Furthermore, by using the iterative method, for any $h_{k}\in H_{k}, k = 0, 1, \ldots$ , the conditional probability distribution $\mu_{n}$ of the unobservable variable $Y_{n}$ is presented by the following:

$\begin{eqnarray} \mu_{0}(C|h_{0})&: = &Q_{0}(C) , \end{eqnarray}$

(3.2)

$\begin{eqnarray} \mu_{k+1}(C|h_{k},a,x^{'},\lambda^{'})&: = &\Phi(x_{k},\lambda_{k},a,x^{'},\lambda^{'},\mu_{k}(\cdot|h_{k}))(C), \end{eqnarray}$

(3.3)

which are called filter equations.

Lemma 3.1. Under Assumption 3.1, for any $\pi\in{\Pi}, B\in \mathscr{B}(E_{Y})$ , the following statement holds:

$\begin{eqnarray} {P}^{{\pi}}_{(x,\lambda)}(Y_{n}\in B|X_{0},\Lambda_{0},A_{0},\ldots,X_{n},\Lambda_{n}) = \mu_{n}(B|X_{0},\Lambda_{0},A_{0},\ldots,X_{n},\Lambda_{n}). \end{eqnarray}$

(3.4)

Proof. For each $\pi\in\Pi$ and $x \in E_{X}, \lambda \in R$ , the following result is proven by induction:

$\begin{eqnarray} E_{(x,\lambda)}^{\pi}[V(X_{0},\Lambda_{0},A_{0},\ldots,X_{n},\Lambda_{n},Y_{n})]& = &{E}_{(x,\lambda)}^{\pi}[\hat{V}(X_{0},\Lambda_{0},A_{0},\ldots,X_{n},\Lambda_{n})], \end{eqnarray}$

(3.5)

for the bounded and measurable function $V:H_{k}\times E_{Y} \rightarrow R$ and $\hat{V}(h_{n}) = \int V(h_{n}, y_{n})\mu_{n}({\mathrm d} y_{n}|h_{n})$ . Since $\hat{V}(h_{0}) = \int V(h_{0}, y)Q_{0}({\mathrm d}y)$ , Fact (3.5) is true when $n = 0$ . For any given $h_{n-1}\in H_{n-1}, n\geq1$ , suppose that the Fact (3.5) holds for $k = n-1$ . Using (2.6), the Bayes operator's definition and Fubini's theorem, we obtain the following:

$\begin{eqnarray} &&{E}_{(x,\lambda)}^{\pi}[\hat{V}(h_{n-1},A_{n-1},X_{n},\Lambda_{n})]\\ & = &\int_{E_{Y}}\mu_{n-1}({\mathrm d}y_{n-1}|h_{n-1}) \int_{E_{X}}\int_{A} Q^{X}({\mathrm d}x_{n}| x_{n-1},y_{n-1},a_{n-1})\\ &&\times\int_{R}I_{\{\lambda_{n-1}-r(x_{n-1},y_{n-1},a_{n-1})\}}({\mathrm d}\lambda_{n}) \hat{V}(h_{n-1},a_{n-1},x_{n},\lambda_{n})\pi_{n-1}({\mathrm d}a_{n-1}|h_{n-1})\\ & = &\int_{E_{Y}}\mu_{n-1}({\mathrm d}y_{n-1}|h_{n-1}) \int_{E_{X}}\int_{A} Q^{X}({\mathrm d}x_{n}| x_{n-1},y_{n-1},a_{n-1}) \int_{R}I_{\{\lambda_{n-1}-r(x_{n-1},y_{n-1},a_{n-1})\}}({\mathrm d}\lambda_{n}) \\ &&\times\int_{E_{Y}}{V}(h_{n-1},a_{n-1},x_{n},\lambda_{n},y_{n})\mu_{n}({\mathrm d}y_{n}|h_{n-1},a_{n-1},x_{n},\lambda_{n})\pi_{n-1}({\mathrm d}a_{n-1}|h_{n-1})\\ & = &\int_{E_{Y}}\mu_{n-1}({\mathrm d}y_{n-1}|h_{n-1}) \int_{E_{X}}\int_{E_{Y}}\int_{A}\rho({\mathrm d}x_{n}) \nu({\mathrm d}y)q(x_{n},y| x_{n-1},y_{n-1},a_{n-1}) \int_{R}I_{\{\lambda_{n-1}-r(x_{n-1},y_{n-1},a_{n-1})\}}({\mathrm d}\lambda_{n})\\ &&\times\int_{E_{Y}} {V}(h_{n-1},a_{n-1},x_{n},\lambda_{n},y_{n})\Phi(x_{n-1},\lambda_{n-1},a_{n-1},x_{n},\lambda_{n},\mu_{n-1})({\mathrm d}y_{n})\pi_{n-1}({\mathrm d}a_{n-1}|h_{n-1})\\ & = &\int_{E_{Y}}\mu_{n-1}({\mathrm d}y_{n-1}|h_{n-1}) \int_{E_{X}} \int_{E_{Y}}\int_{A} q(x_{n},y_{n}|x_{n-1},y_{n-1},a_{n-1})\rho({\mathrm d}x_{n})\nu({\mathrm d}y_{n})\\ &&\times V(h_{n-1},a_{n-1},x_{n},\lambda_{n-1}-r(x_{n-1},y_{n-1},a_{n-1}),y_{n})\pi_{n-1}({\mathrm d}a_{n-1}|h_{n-1}), \end{eqnarray}$

(3.6)

On the other hand, by induction, we have the following:

$\begin{eqnarray} &&E_{(x,\lambda)}^{\pi}[V(h_{n-1},A_{n-1},X_{n},\Lambda_{n},Y_{n})]\\ & = &\int_{E_{Y}}\mu_{n-1}({\mathrm d}y_{n-1}|h_{n-1}) \int_{E_{X}} \int_{E_{Y}}\int_{A} Q({\mathrm d}(x_{n},y_{n})|x_{n-1},y_{n-1},a_{n-1})\\ &&\times\int_{R}I_{\{\lambda_{n-1}-r(x_{n-1},y_{n-1},a_{n-1})\}}({\mathrm d}\lambda_{n}) V(h_{n-1},a_{n-1},x_{n},\lambda_{n},y_{n})\pi_{n-1}({\mathrm d}a_{n-1}|h_{n-1}),\\ & = &\int_{E_{Y}}\mu_{n-1}({\mathrm d}y_{n-1}|h_{n-1}) \int_{E_{X}} \int_{E_{Y}}\int_{A}q(x_{n},y_{n}|x_{n-1},y_{n-1},a_{n-1})\rho({\mathrm d}x_{n})\nu({\mathrm d}y_{n})\\ &&\times\int_{R}I_{\{\lambda_{n-1}-r(x_{n-1},y_{n-1},a_{n-1})\}}({\mathrm d}\lambda_{n}) V(h_{n-1},a_{n-1},x_{n},\lambda_{n},y_{n})\pi_{n-1}({\mathrm d}a_{n-1}|h_{n-1}),\\ & = &\int_{E_{Y}}\mu_{n-1}({\mathrm d}y_{n-1}|h_{n-1}) \int_{E_{X}} \int_{E_{Y}}\int_{A} q(x_{n},y_{n}|x_{n-1},y_{n-1},a_{n-1})\rho({\mathrm d}x_{n})\nu({\mathrm d}y_{n})\\ &&\times V(h_{n-1},a_{n-1},x_{n},\lambda_{n-1}-r(x_{n-1},y_{n-1},a_{n-1}),y_{n})\pi_{n-1}({\mathrm d}a_{n-1}|h_{n-1}), \end{eqnarray}$

(3.7)

which, together with Eq (3.6), implies the fact that (3.5) is satisfied. Specially, $V(X_{0}, \Lambda_{0}, A_{0}, \ldots, X_{n}, \; \Lambda_{n}, Y_{n}) = I_{B\times C} (Y_{n}, (X_{0}, \Lambda_{0}, A_{0}, \ldots, X_{n}, \Lambda_{n}))$ , we obtain the following:

$\begin{eqnarray*} &&P_{(x,\lambda)}^{\pi}\Big(Y_{n}\in B,(X_{0},\Lambda_{0},A_{0},\ldots,X_{n},\Lambda_{n})\in C\Big)\\ && = E_{(x,\lambda)}^{\pi}[\mu_{n}(B|X_{0},\Lambda_{0},A_{0},\ldots,X_{n},\Lambda_{n})I_{C}(X_{0},\Lambda_{0},A_{0},\ldots,X_{n},\Lambda_{n})], \end{eqnarray*}$

which implies that the Lemma holds.

The partially observable risk probability MDPs can be transformed into the filtered risk probability MDPs by enlarging the state space, modifying the transfer kernel and reward function.

Definition 3.1. The filtered model of POMDPs consists of the following elements $\{E, A, \hat{Q}, \hat{r}\}$ , which have the following meanings:

$\bullet \; E: = E_{X}\times \mathcal{P}(E_{Y})$ denotes the state space, and its element is marked as $(x, \mu)\in E$ , where $x$ denotes the observable state and $\mu$ denotes the unobservable state's conditional probability distribution.

$\bullet \; A$ denotes the action space. $A(x, \mu): = A(x)\subseteq A$ denotes the class of selectable actions in the state $(x, \mu)\in E$ .

$\bullet \; \hat{r}$ denotes a nonnegative real-valued measurable reward function on $\mathbb{K}$ and satisfies the following:

$\hat{r}(x,\mu,a) = \int r(x,y,a)\mu({\mathrm d}y).$

$\bullet \; \hat{Q}$ denotes the transition law from $E\times R \times A$ to $E\times R$ , which is specifically expressed as follows:

$\begin{eqnarray*} \hat{Q}(B\times C\times D| x,\lambda,\mu,a)&: = &\int_{B}\int_{C}\int_{E_{Y}}I_{D}(\Phi(x,\lambda,a,\hat{x},\hat{\lambda},\mu))I_{\{\lambda-\hat{r}(x,\mu,a)\}}({\mathrm d}\hat{\lambda})\\ &&\times Q^{X}({\mathrm d}\hat{x}| x,\mu,a), \end{eqnarray*}$

where $Q^{X}(B| x, \mu, a) = \int_{B}Q^{X}(B| x, y, a)\mu({\mathrm d}y)$ for all $(x, \mu)\in E, \lambda\in R, a\in A(x), B\subset E_{X}, D\subset\mathcal{P}(E_{Y}), C\subset R$ .

To strictly assure the optimal problems normalization, some notations and the definition of some policies are given in the filtered MDPs. ${\Phi}$ stands for the class of stochastic kernels ${\varphi}$ on $A$ provided $E\times R$ with the property ${\varphi}(A(x)|x, \lambda, \mu) = 1$ for all $(x, \lambda, \mu)\in E\times R$ . ${F}$ stands for the class of all the measurable mappings ${f}$ from $E\times R$ to $A$ with ${f}(x, \lambda, \mu)\in A(x)$ for all $(x, \lambda, \mu) \in E\times R$ .

Definition 3.2. A randomized Markov policy is a sequence ${\pi}_{M} = \{{\hat{\varphi}}_{k}, k\geq0\}$ of stochastic kernels $\hat{\varphi}_{k}\in {\Phi}$ satisfying $\hat{\varphi}_{k}(A(x_{k})|x_{k}, \lambda_{k}, \mu_{k}) = 1$ for each $\mu_{k}\in \mathcal{P}(E_{Y}), k\geq 0$ . This randomized Markov policy is represented as ${\pi}_{M} = \{\hat{\varphi}_{k}\}$ .

A randomized Markov policy ${\pi_{M}} = \{\hat{\varphi}_{k}\}$ is called a deterministic Markov if a function sequence $\{{f_{k}}, k\geq0\}$ exists such that $\hat{\varphi}_{k}(\cdot|x_{k}, \lambda_{k}, \mu_{k})$ is concentrated at ${f_{k}}(x_{k}, \lambda_{k}, \mu_{k})$ for any $f_{k}\in {F}$ .

The class of all randomized and deterministic Markov policies are recorded as ${\Pi}_{RM}, {\Pi}_{DM}$ , respectively. In fact, from the above definition, these randomized Markov policies rely on historical information $h_{k}, k\geq0$ . Then, for any ${\pi}_{M} = \{\hat{\varphi}_{0}, \hat{\varphi}_{1}, \ldots\}\in {\Pi}_{M}$ , we can find a policy ${\pi} = \{{\pi}_{0}, {\pi}_{1}, \ldots\}\in {\Pi}$ that satisfies the following:

$\begin{eqnarray} {\pi}_{0}({\mathrm d}a_{0}|x_{0},y_{0},\lambda_{0})&: = &\hat{\varphi}_{0}({\mathrm d}a_{0}|x_{0},\mu_{0}(y_{0}|x_{0},\lambda_{0}),\lambda_{0}), \end{eqnarray}$

(3.8)

$\begin{eqnarray} {\pi}_{k}({\mathrm d}a_{k}|h_{k})&: = &\hat{\varphi}_{k}({\mathrm d}a_{k}|x_{k},\mu_{k}(y_{k}|h_{k}),\lambda_{k}), \end{eqnarray}$

(3.9)

for $k\geq0, {h}_{k}\in {H}_{k}$ . Thus, ${\Pi}_{DM}\subseteq{\Pi}_{RM}\subseteq {\Pi}$ .

Based on the probability of the transition $\hat{Q}$ and initial distribution $\mu_{0}$ , for any $(x, \lambda, \mu) \in E\times R$ and ${\pi}\in {\Pi}$ , according to the Ionescu Tulcea theorem (e.g., Proposition 7.45 in ^[25]), the probability measure $\hat{P}_{(x, \lambda, \mu)}^{{\pi}}$ can be constructed on $(\Omega, \mathcal{F})$ as follows:

$\begin{eqnarray} \hat{P}_{(x,\lambda,\mu)}^{\pi}(X_{0} = x,\Lambda_{0} = \lambda)& = &1, \end{eqnarray}$

(3.10)

$\begin{eqnarray} \hat{P}_{(x,\lambda,\mu)}^{\pi}(A_{k}\in G|h_{k})& = &\pi_{k}(G|h_{k}), \end{eqnarray}$

(3.11)

$\begin{eqnarray} \hat{P}_{(x,\lambda,\mu)}^{\pi}(X_{k+1}\in B,Y_{k+1}\in C,\Lambda_{k+1}\in D|h_{k},\mu_{k})& = &\int_{B}\int_{C}\int_{D}\int_{A}\\ &&\times\hat{Q}({\mathrm d}x_{k+1},{\mathrm d}\lambda_{k+1},{\mathrm d}\mu_{k+1}| x_{k},\lambda_{k},\mu_{k},a_{k})\\ &&\times\pi_{k}({\mathrm d}a_{k}|h_{k}), \end{eqnarray}$

(3.12)

for all $h_{k}\in H_{k}, a\in A(x_{k}), B\subset E_{X}, C\in E_{Y}, D\subset R, G \in \mathscr{B}(A)$ . The expected operator corresponds to the probability measure $\hat{P}_{(x, \lambda, \mu)}^{{\pi}}$ and is expressed as $\hat{\mathbb{E}}_{(x, \lambda, \mu)}^{{\pi}}$ .

For any $(x, \lambda, \mu) \in E\times R$ and ${\pi}\in{\Pi}$ , the value function of the filtered MDP is given by the following:

$\begin{eqnarray} U^{{\pi}}_{N}(x,\lambda,\mu)&: = &\hat{P}^{{\pi}}_{(x,\lambda,\mu)}\Big(\sum\limits_{n = 0}^{N}\hat{r}(X_{n},\mu_{n},A_{n})\leq\lambda\Big), \end{eqnarray}$

(3.13)

$\begin{eqnarray} U^{*}_{N}(x,\lambda,\mu)&: = &\inf\limits_{{\pi}\in{\Pi}}U^{{\pi}}_{N}(x,\lambda,\mu). \end{eqnarray}$

(3.14)

Notation: For any policy $\pi\in\Pi$ and $(x, \lambda, \mu) \in E\times R$ , the risk probability of the total rewards $U_{n}^{\pi}$ is defined as follows:

$U_{n}^{\pi}(x,\lambda,\mu): = \hat{P}^{{\pi}}_{(x,\lambda,\mu)}\Big(\sum\limits_{k = 0}^{n}\hat{r}(X_{k},\mu_{k},A_{k})\leq\lambda\Big),$

with $n = 0, 1, \ldots, N$ .

Moreover, the minimal risk probability of the filtered MDPs model is defined by the following:

$U_{n}^{*}(x,\lambda,\mu): = \inf\limits_{\pi\in\Pi}U_{n}^{\pi}(x,\lambda,\mu).$

Let $\mathcal{U}$ be the class of mappings $U: E\times R \rightarrow [0, 1]$ . For any $(x, \lambda, \mu)\in E\times R, U\in\mathcal{U}$ , $\varphi\in\phi$ , and $a\in A(x)$ , the operators $T^{\varphi}U$ and $TU$ are defined as follows:

$\begin{eqnarray} T^{a}U(x,\lambda,\mu) &: = &\int_{E_{X}}\int_{E_{Y}}U\Big(\hat{x},\lambda-\hat{r}(x,\mu,a),\Phi(x,\lambda,a,\hat{x},\lambda-\hat{r}(x,\mu,a),\mu)\Big)\\ &&\times Q^{X}({\mathrm d}\hat{x}| x,\mu,a),\\ T^{\varphi}U(x,\lambda,\mu)&: = &\int_{A }T^{a}U(x,\lambda,\mu)\varphi({\mathrm d}a|x,\lambda,\mu), \end{eqnarray}$

(3.15)

$\begin{eqnarray} TU(x,\lambda,\mu)&: = &\min\limits_{a\in A(x)}T^{a}U(x,\lambda,\mu). \end{eqnarray}$

(3.16)

To strictly show the unobservable state's conditional distribution, some characteristics of the filter equation are given.

Lemma 3.2. Under Assumption 3.1, for each $\pi\in\Pi$ , $x \in E_{X}, \lambda \in R$ . Then, $F^{\pi}_{N}(x, \lambda) = U^{{\pi}}_{N}(x, \lambda, Q_{0})$ , $F^{*}_{N}(x, \lambda) = U^{*}_{N}(x, \lambda, Q_{0})$ .

Proof. For each $\pi\in\Pi$ and $x \in E_{X}$ , the following result is first proven by induction:

$\begin{eqnarray} F^{\pi}_{n}(x,\Lambda) = U^{{\pi}}_{n}(x,\hat{\Lambda},\mu), \end{eqnarray}$

(3.17)

with $n = -1, 0, 1, 2, \ldots$ , for the reward goal function $\Lambda:K\times E_{Y} \rightarrow R^{+}$ and $\hat{\Lambda} = \int \Lambda(x, y, a)\mu({\mathrm d} y)$ .

Based on $F^{\pi}_{-1} = U^{{\pi}}_{-1} = I_{[0, +\infty)}(\lambda)$ , for any $\pi\in\Pi$ and $x \in E_{X}, \lambda \in R$ , Eq (3.17) is valid when $n = -1$ . Suppose that Fact (3.17) holds for $k = n$ ; by (2.6),

$\begin{eqnarray} &&U_{n+1}^{\pi}(x,\lambda,\mu_{0})\\ & = &\hat{P}^{{\pi}}_{(x,\lambda,\mu_{0})}\Big(\sum\limits_{k = 0}^{n+1}\hat{r}(X_{k},\mu_{k},A_{k})\leq\lambda\Big)\\ & = &\hat{E}^{{\pi}}_{(x,\lambda,\mu_{0})}[\hat{E}^{{\pi}}_{(x,\lambda,\mu_{0})}[I_{\{\sum\limits_{k = 0}^{n+1}\hat{r}(X_{k},\mu_{k},A_{k})\leq\lambda\}}|X_{0},\Lambda_{0},\mu_{0},A_{0},X_{1},\Lambda_{1},\mu_{1}]]\\ & = &\int_{E_{X}}\int_{R}\int_{\mathcal{P}(E_{Y})}\int_{A}\hat{E}^{{\pi}}_{(x,\lambda,\mu)}[I_{\{\sum\limits_{k = 0}^{n+1}\hat{r}(X_{k},\mu_{k},A_{k})\leq\lambda\}}|X_{0} = x,\Lambda_{0} = \lambda,\mu_{0} = Q_{0},A_{0} = a_{0},X_{1} = x_{1},\\ &&\times \lambda_{1} = \lambda-\hat{r}(x,Q_{0},a_{0}),\mu_{1} = \Phi(x,\lambda,a_{0},x_{1},\lambda_{1},Q_{0})]\\ &&\times\hat{Q}({\mathrm d}x_{1},{\mathrm d}\lambda_{1},{\mathrm d}\mu_{1}| x,\lambda,Q_{0},a_{0})\pi_{0}({\mathrm d}a_{0}|x,\lambda) \\ & = &\int_{E_{X}}\int_{A}\int_{E_{Y}}\hat{P}^{^{1}{\pi}}_{(x_{1},\lambda-\hat{r}(x,Q_{0},a_{0}),\Phi(x,\lambda,a_{0},x_{1},\lambda_{1},Q_{0}))}\Big(\sum\limits_{k = 0}^{n}\hat{r}(X_{k},\mu_{k},A_{k})\leq\lambda_{1}\Big)\\ &&\times Q^{X}({\mathrm d}x_{1}| x,Q_{0},a_{0})\pi_{0}({\mathrm d}a_{0}|x,\lambda)\\ & = &\int_{E_{X}}\int_{A}\int_{E_{Y}}U_{n}^{^{1}\pi}\Big(x_{1},\lambda-\hat{r}(x,Q_{0},a_{0}),\Phi(x,\lambda,a_{0},x_{1},\lambda_{1},Q_{0})\Big)\\ &&\times Q^{X}({\mathrm d}x_{1}| x,Q_{0},a_{0})\pi_{0}({\mathrm d}a_{0}|x,\lambda) \end{eqnarray}$

(3.18)

is obtained, where $^{1}\pi: = \{\pi_{1}, \pi_{2}, ...\}$ represents the 1-shift policy of $\pi$ .

On the other side, by Eq (3.4), we have the following:

$\begin{eqnarray*} &&F_{n+1}^{\pi}(x,\lambda)\nonumber\\ & = & \int_{E_{Y}}P^{\pi}_{(x,y,\lambda)}\Big(\sum\limits_{k = 0}^{n+1}r(X_{k},Y_{k},A_{k})\leq\lambda\Big)Q_{0}({\mathrm d}y)\nonumber\\ & = &\int_{E_{Y}}{E}^{{\pi}}_{(x,y,\lambda)}[{E}^{{\pi}}_{(x,y,\lambda)}[I_{\{\sum\limits_{k = 0}^{n+1}{r}(X_{k},Y_{k},A_{k})\leq\lambda\}}|X_{0},\Lambda_{0},Y_{0},A_{0},X_{1},\Lambda_{1},Y_{1}]]Q_{0}({\mathrm d} y)\nonumber\\ & = &\int_{E_{Y}}\int_{E_{X}}\int_{R}\int_{A}\int_{E_{Y}}{E}^{{\pi}}_{(x,y,\lambda)}[{E}^{{\pi}}_{(x,y,\lambda)}[I_{\{\sum\limits_{k = 0}^{n+1}{r}(X_{k},Y_{k},A_{k})\leq\lambda\}}|X_{0} = x,\Lambda_{0} = \lambda,Y_{0} = y,A_{0} = a_{0},X_{1} = x_{1},\\ &&\Lambda_{1} = \lambda_{1},Y_{1} = y_{1}]]\Phi(x,\lambda,a_{0},x_{1},\lambda_{1},Q_{0})({\mathrm d} y_{1})Q^{X}({\mathrm d}x_{1}| x,y,a_{0})I_{\{\lambda-{r}(x,y,a)\}}({\mathrm d}{\lambda_{1}})\pi_{0}({\mathrm d}a_{0}|x,\lambda)Q_{0}({\mathrm d} y)\nonumber\\ & = & \int_{E_{Y}}\int_{E_{X}}\int_{R}\int_{A}\int_{E_{Y}}{P}^{{^{1}\pi}}_{(x_{1},y_{1},\lambda-r(x,y,a_{0}))}\Big(\sum\limits_{k = 0}^{n}{r}(X_{k},Y_{k},A_{k})\leq\lambda-r(x,y,a_{0})\Big) \Phi(x,\lambda,a_{0},x_{1},\lambda_{1},Q_{0})({\mathrm d}y_{1})\nonumber\\ &&\times Q^{X}({\mathrm d}x_{1}| x,y,a_{0})I_{\{\lambda-{r}(x,y,a_{0})\}}({\mathrm d}{\lambda_{1}})\pi_{0}({\mathrm d}a_{0}|x,\lambda)Q_{0}({\mathrm d} y)\nonumber\\ & = &\int_{E_{X}}\int_{A}\int_{E_{Y}}{F}_{n}^{{^{1}\pi}}(x_{1},\lambda-{r}(x,y,a_{0})) Q^{X}({\mathrm d}x_{1}| x,y,a_{0})\pi_{0}({\mathrm d}a_{0}|x,\lambda)Q_{0}({\mathrm d} y), \end{eqnarray*}$

which, together with Eq (3.18) and the inductive hypothesis, can prove Eq (3.17) for $n = 0, 1, \ldots, N$ , i.e, $F^{\pi}_{n}(x, \lambda) = U^{{\pi}}_{n}(x, \lambda, \mu)$ .

For $n = N$ , $F^{\pi}_{N}(x, \lambda) = U^{{\pi}}_{N}(x, \lambda, \mu)$ , which yields $F^{*}_{N}(x, \lambda) = U^{*}_{N}(x, \lambda, Q_{0})$ for the arbitrary policy $\pi$ .

The establishment of the optimality equation requires the following theorem.

Theorem 3.1. Suppose that Assumption 3.1 is satisfied. Then, for any $(x, \lambda, \mu) \in E\times R, \pi = \{\pi_{0}, \pi_{1}, \ldots\}\in\Pi, n\geq0$ , the following statement holds: $U^{\pi}_{n+1}(x, \lambda, \mu) = T^{\pi_{0}}U_{n}^{^{1}\pi}(x, \lambda, \mu)$ , where $U^{\pi}_{0}(x, \lambda, \mu) = I_{[0, +\infty)}(\lambda), ^{1}\pi: = \{\pi_{1}, \pi_{2}, ...\}$ represents the 1-shift policy of $\pi$ .

Proof. For any $(x, \lambda, \mu) \in E\times R, \pi = \{\pi_{0}, \pi_{1}, \ldots\}\in\Pi, n = 0, 1, \ldots, N-1$ , by (3.8) and the properties of conditional expectation, we can obtain the following:

$\begin{eqnarray*} &&U^{\pi}_{n+1}(x,\lambda,\mu)\\ & = &\hat{P}^{\pi}_{(x,\lambda,\mu)}\Big(\sum\limits_{k = 0}^{n+1}\hat{r}(X_{k},\mu_{k},A_{k})\leq\lambda\Big) \\ & = &\hat{E}^{\pi}_{(x,\lambda,\mu)}[\hat{E}^{\pi}_{(x,\lambda,\mu)}[I_{\{\sum_{k = 0}^{n+1}\hat{r}(X_{k},\mu_{k},A_{k})\leq\lambda\}}|X_{0},\Lambda_{0},\mu_{0},A_{0},X_{1},\Lambda_{1},\mu_{1}]]\nonumber\\ & = &\int_{ A}\int_{E_{X}}\int_{R}\int_{\mathcal{P}(E_{Y})}\hat{P}^{\pi}_{(x,\lambda,\mu)}\Big(\sum\limits_{k = 0}^{n+1}\hat{r}(X_{k},\mu_{k},A_{k})\leq\lambda |X_{0} = x,\Lambda_{0} = \lambda,\mu_{0} = Q_{0}, A_{0} = a,X_{1} = \hat{x},\nonumber\\ && \Lambda_{1} = \hat{\lambda},\mu_{1} = \hat{\mu}\Big) \hat{Q}(\mathrm d\hat{x},\mathrm d\hat{\lambda},\mathrm d\hat{\mu}| x,\lambda,\mu,a)\pi_{0}(da|x,\lambda)\nonumber\\ & = &\int_{E_{X}}\int_{R}\int_{ A}\int_{E_{Y}}\hat{P}^{\pi}_{(x,\lambda,\mu)}\Big(\sum\limits_{k = 0}^{n+1}\hat{r}(X_{k},\mu_{k},A_{k})\leq\lambda |X_{0} = x,\Lambda_{0} = \lambda,\mu_{0} = Q_{0}, A_{0} = a,X_{1} = \hat{x},\nonumber\\ && \Lambda_{1} = {\hat{\lambda}},\mu_{1} = \hat{\mu}\Big) Q^{X}({\mathrm d}\hat{x}| x,\mu,a)I_{\{\lambda-\hat{r}(x,\mu,a)\}}({\mathrm d}\hat{\lambda})\pi_{0}({\mathrm d}a|x,\lambda)\nonumber\\ & = &\int_{E_{X}}\int_{A}\int_{E_{Y}}\hat{P}^{^{1}\pi}_{(\hat{x},\lambda -\hat{r}(x,\mu,a),\Phi(x,\lambda,a,\hat{x},\lambda -\hat{r}(x,\mu,a),\mu))}\Big(\sum\limits_{k = 0}^{n}\hat{r}(X_{k},\mu_{k},A_{k})\leq\lambda -\hat{r}(x,\mu,a)\Big)\nonumber\\ &&\times Q^{X}({\mathrm d}\hat{x}| x,\mu,a)\pi_{0}({\mathrm d}a|x,\lambda)\nonumber\\ & = &\int_{E_{X}}\int_{A}\int_{E_{Y}}U_{n}^{^{1}\pi}\Big(\hat{x},\lambda-\hat{r}(x,\mu,a),\Phi(x,\lambda,a,\hat{x},\lambda-\hat{r}(x,\mu,a),\mu)\Big) Q^{X}({\mathrm d}\hat{x}| x,\mu,a)\pi_{0}(\mathrm d a|x,\lambda)\nonumber\\ &: = &T^{\pi_{0}}U_{n}^{^{1}\pi}(x,\lambda,\mu). \end{eqnarray*}$

The proof of this conclusion has been completed.

Theorem 3.2. Suppose that Assumption 3.1 holds. For each $(x, \lambda, \mu)\in E\times R$ , then:

(a) $\{ U_{n}^{*}, n = 0, 1, \ldots, N-1\}$ satisfies the corresponding optimality equation:

$U_{0}^*(x,\lambda,\mu): = I_{[0,\infty)}(\lambda),\; U_{n+1}^{*}(x,\lambda,\mu): = TU_n^{*}(x,\lambda,\mu).$

(b) There exists a policy ${g}_{n}\in\Pi_{DM}$ such that $U_{n+1}^* = T^{{g}_{n}}U_{n}^*$ for $n = 0, 1, \ldots, N-1$ . Then, the policy $\pi^{*}: = \{{f^{*}_0}, {f^{*}_1}, \ldots, f_{N-1}\}\in\Pi_{DH}$ is optimal, where ${f^{*}_n}(h_n): = g_{N-1-n}(x_n, {\lambda}_n, \mu_{n})$ for each $h_n\in H_n$ , $n = 0, 1, \ldots, N-1$ .

Proof. (a) According to Theorem 3.1 and (3.16), for each $(x, \lambda, \mu)\in E\times R, \pi = \{\pi_{0}, \pi_{1}, \ldots\}\in\Pi$ , we have the following:

$\begin{eqnarray} U_{n+1}^{\pi}(x,\lambda,\mu) = T^{\pi_{0}}U_{n}^{^{1}\pi}(x,\lambda,\mu)\geq T^{\pi_{0}}U_{n}^{*}(x,\lambda,\mu)\geq TU_{n}^{*}(x,\lambda,\mu). \end{eqnarray}$

(3.19)

Since $\pi$ is arbitrary, this implies $U_{n+1}^{*}(x, \lambda, \mu)\geq TU_{n}^{*}(x, \lambda, \mu)$ .

To prove the reverse condition, the following fact is needed to be proven: for any $(x, \lambda, \mu)\in E\times R$ and $n\geq-1$ , there is a policy $\eta\in\Pi_{DM}$ which satisfies $U_{n}^{*}(x, \lambda, \mu) = U_{n}^{\eta}(x, \lambda, \mu)$ . Since $U_{-1}^{*}(x, \lambda, \mu) = U_{-1}^{\pi}(x, \lambda, \mu) = I_{[0, +\infty)}(\lambda)$ for any $\pi\in\Pi_{M}$ , this fact trivially holds for $n = -1$ . Assume that there exists a policy $\zeta\in\Pi_{DM}$ that satisfies $U_{k}^{*}(x, \lambda, \mu) = U_{k}^{\zeta}(x, \lambda, \mu)$ for $n = k\geq-1$ . On the other hand, since the set of actions is finite, there exists a policy $f\in \Pi_{s}$ that satisfies $T^{f}U^{*}_{k}(x, \lambda, \mu) = TU^{*}_{k}(x, \lambda, \mu)$ . Then, let $\theta = \{f, \zeta\}\in\Pi_{DM}$ , we know that

$\begin{eqnarray} U_{k+1}^{*}(x,\lambda,\mu)\leq U_{k+1}^{\theta}(x,\lambda,\mu) = T^{f}U_{k}^{\zeta}(x,\lambda,\mu) = T^{f}U_{k}^{*}(x,\lambda,\mu) = TU_{k}^{*}(x,\lambda,\mu), \end{eqnarray}$

(3.20)

where the first equality is obtained by Lemma 3.1, and the second equality follows from the induction hypothesis. Combining $(3.19)$ and $(3.20)$ , we have $TU_{k}^{*} = U_{k+1}^{*}$ , which implies the induction hypothesis is satisfied and the result is proven.

(b) For each $(x, \lambda, \mu)\in E\times R$ , the existence of a policy $g_{n}^*\in\Pi_{DM}$ satisfying $U_{n+1}^* = T^{g_{n}^*}U_{n}^*$ , is determined by the finiteness of the action set for $n = 0, 1, \ldots, N-1$ . Letting $\pi = \pi(n): = \{g_{n}, g_{n-1}, \ldots, g_{0}\}\in\Pi$ , when $n = 0$ , by Lemma 3.1 (b), $U_{1}^{\pi} = T^{g_{0}}U_{0}^{^{1}{\pi}} = T^{g_{0}}U_{0}^{*} = TU_{0}^{*} = U_{1}^{{*}}.$ Assuming that $U_{k}^{{*}} = U_{k}^{^{1}\pi^{*}}$ for $n = k$ , by Lemma 3.1 (b) and part (a),

$U_{k+1}^{\pi} = T^{{g}^{*}_{k}}U_{k}^{^{1}{\pi^{*}}} = T^{g_{k}}U_{k}^{*} = TU_{k}^{*} = U_{k+1}^{{*}}.$

Thus, the induction hypothesis is established and $U_{N}^{\pi^{*}} = U_{N}^{{*}}$ for $\pi^{*}: = \{{f^{*}_0}, {f^{*}_1}, \ldots, f_{N-1}\}$ with ${f^{*}_n}(h_n): = g_{N-1-n}(x_n, {\lambda}_n, \mu_{n}(\cdot|h_{n})), n = 0, 1, \ldots, N-1$ . Then, the policy $\pi^{*}$ is optimal.

Based on Theorem 3.2, the value iteration algorithm is established as follows:

The value iteration algorithm

Step 1. Let $U_{0}^{*}(x, \lambda, \mu): = I_{[0, +\infty)}(\lambda)$ , for $(x, \lambda, \mu)\in E\times R$ .

Step 2. The computation of the value function $U_{n}^{*}$ is as follows for $n = 0, 1, \ldots, N-1:$

$\begin{eqnarray*} T^{a}U_{n}^{*}(x,\lambda,\mu) & = &\int_{E_{X}}\int_{E_{Y}}U_{n}^{*}\Big(\hat{x},\lambda-\hat{r}(x,\mu,a),\Phi(x,\lambda,a,\hat{x},\lambda-\hat{r}(x,\mu,a),\mu)\Big)\nonumber\\ &&\times Q^{X}({\mathrm d}\hat{x}| x,\mu,a).\nonumber\\ U_{n+1}^{*}(x,\lambda,\mu)& = &\min\limits_{a\in A(x)}\{T^{a}U_{n}^{*}(x,\lambda,\mu)\}. \label{sx4} \end{eqnarray*}$

Step 3. Find a policy ${g}_{N-1}$ that satisfies $U_{N}^* = T^{{g}_{N-1}}U_{N-1}^*$ . Then, by Theorem 3.2, the policy $\pi^{*}$ is optimal.

4. Illustration

An illustration is provided to show how both the VF and OP are calculated and illustrates the effectiveness and feasibility of the value iteration algorithm.

Example 4.1. Consider a machine production process with two types of observable product quality states (i.e., nonconforming product 0 and qualified products 1), and two types of unobservable machine operation states (i.e. poor state 1 and good state 2). According to the product quality situation $x = 1$ and the reward goal $\lambda$ , at the initial time $n = 0$ , when the production process is in the state $y\in\{1, 2\}$ , the decision-maker can either select an ordinary maintenance action $a_{11}$ or an advanced maintenance action $a_{12}$ with a reward $r(x, y, a)$ . If the product quality situation is $x = 0$ , the decision maker must select an action of the advanced maintenance $a_{01}$ . When the action of the maintenance $a$ is applied, the system transits to the state $(x', y')$ with probability $Q(\cdot, \cdot|x, y, a)$ at the next moment. The general objective of the decision maker is to select the optimal action to ensure that the minimum probability value of the total rewards does not exceed the target $\lambda$ from 0 to $N = 15$ .

This evolution process can be formulated as a discrete-time POMDP with the state space $E_{X}\times E_{Y} = \{0, 1\}\times \{1, 2\}$ ; the admissible class of actions $A(0) = \{a_{01}\}, A(1) = \{a_{11}, a_{12}\}$ . Assume that the probabilities of the transition are given by $Q(\cdot, \cdot| x, y, a) = Q^{X}(\cdot| x, y, a)p(\cdot|y)$ , in which the probabilities of the transition $Q^{X}(\cdot| x, y, a)$ are given by the following:

$\begin{align} Q^{X}(0|0,1,a_{01})& = 1,&Q^{X}(1|0,1,a_{01})& = 0, &Q^{X}(0|0,2,a_{01})& = 1,&Q^{X}(1|0,2,a_{01})& = 0,\\ Q^{X}(0|1,1,a_{11})& = 0.5,&Q^{X}(1|1,1,a_{11})& = 0.5, &Q^{X}(0|1,1,a_{12})& = 0.3,&Q^{X}(0|1,1,a_{12})& = 0.7,\\ Q^{X}(0|1,2,a_{11})& = 0.4,&Q^{X}(1|1,2,a_{11})& = 0.6, &Q^{X}(0|1,2,a_{12})& = 0.2,&Q^{X}(0|1,2,a_{12})& = 0.8, \end{align}$

(4.1)

The transition probabilities of the unobserved state are given by $p(2\mid2) = 1-p(1\mid2) = 0.7, p(1\mid1) = 1.$ The reward rates are given as follows:

$\begin{eqnarray*} &&r(0,1,a_{01}) = r(0,2,a_{01}) = 0,\ r(1,1,a_{11}) = 2,\\ &&\ r(1,1,a_{12}) = 4,\ r(1,2,a_{11}) = 1,\ r(1,2,a_{22}) = 3. \end{eqnarray*}$

Our main goal is to use the value iteration algorithm to compute the value function and the optimal policies.

{First, according to (3.13), since $r(0, 1, a_{01}) = r(0, 2, a_{01}) = 0$ , it is known that $U^{*}(0, \lambda, \mu) = I_{[0, +\infty)}(\lambda)$ . Based on the value iteration algorithm (Algorithm 1) and Matlab software, the curves of functions $T^{a_{11}}U^*(1, \lambda, \mu)$ , $T^{a_{12}}U^*(1, \lambda, \mu)$ and the approximated value function $U^*(1, \lambda, \mu)$ are plotted (see Figures 1 and 2). By observing the figures, the following conclusions are attained:

Figure 1. The function

$T^{a}U_{N}^{*}(1, \lambda, \mu)$ .

DownLoad: Full-Size Img PowerPoint

Figure 2. The value function

$U_{N}^{*}(1, \lambda, \mu)$ .

DownLoad: Full-Size Img PowerPoint

$(a)$ As seen in , when $x = 1$ , if $\lambda\in (0, 4)$ , the value $T^{a_{12}}U^{*}_{N}(1, \lambda, \mu)$ is less than $T^{a_{11}}U^{*}_{N}(1, \lambda, \mu)$ . Otherwise, if $\lambda\in [4, +\infty)$ , the value $T^{a_{11}}U^{*}_{N}(1, \lambda, \mu)$ is less than $T^{a_{12}}U^{*}_{N}(1, \lambda, \mu)$ . As shown above, the observable state is $x = 1$ , $\lambda\in (0, 4)$ , the decision maker should choose the low risk action $a_{12}$ . Conversely, if $\lambda\in [4, +\infty)$ , the decision maker should choose the low risk action $a_{11}$ instead of the action $a_{12}$ .

$(b)$ Based on , the risk probability optimal policy for POMDPs at time $n = 0, 1, \ldots, N$ is given by the following:

$\begin{eqnarray} f^{*}(1,\lambda) = \begin{cases} a_{12},&0 \leq \lambda < 4;\\ a_{11},&\lambda\geq 4. \end{cases} \end{eqnarray}$

(4.2)

5. Conclusions

In this paper, we studied the problem of minimizing the risk probability criterion for finite horizon partially observable discrete-time Markov decision processes (POMDPs). Different from the classical expectation criterion, which are regarded as a component of an extended state according to the reward levels, we redefined a history-dependent policy, and reconstructed a new probability measure. Based on the Bayes operator and the filter equations we constructed, the optimization problem of risk probability can be equivalently reformulated as filtered Markov decision processes. We proposed a value iteration algorithm to establish the existence of a solution to the optimality equation, and a risk probability optimal policy.

Use of AI tools declaration

The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

Acknowledgments

This work was supported by Guangxi science and technology base and talent project (Grant No. AD21159005); Foundation of Guangxi Educational Committee (Grant No. KY2022KY0342); National Natural Science Foundation of China (Grant No. 11961005, 12361091); Guangxi Natural Science Foundation Program (Grant No. 2020GXNSFAA297196); The Doctoral Foundation of Guangxi University of Science and Technology (Grant No. 18Z06).

Conflict of interest

The authors declare that there is no conflict of interests regarding the publication of this paper.

References

[1]	N. Bauerle, U. Rieder, Markov decision processes with applications to finance, Heidelberg: Springer, 2011. https://doi.org/10.1007/978-3-642-18324-9
[2]	J. Janssen, R. Manca, Semi-Markov risk models for finance, insurance and reliability, New York: Springer, 2006. https://doi.org/10.1007/0-387-70730-1
[3]	X. P. Guo, O. Hernández-Lerma, Continuous-time Markov decision processes: Theorey and applications, Berlin: Springer-Verlag, 2009. https://doi.org/10.1007/978-3-642-02547-1
[4]	M. J. Sobel, The variance of discounted Markov decision processes, J. Appl. Probab., 19 (1982), 794–802. https://doi.org/10.1017/s0021900200023123 doi: 10.1017/s0021900200023123
[5]	Y. Ohtsubo, K. Toyonaga, Optimal policy for minimizing risk models in Markov decision processes, J. Math. Anal. Appl., 271 (2002), 66–81. https://doi.org/10.1016/s0022-247x(02)00097-5 doi: 10.1016/s0022-247x(02)00097-5
[6]	D. J. White, Minimizing a threshold probability in discounted Markov decision processes, J. Math. Anal. Appl., 173 (1993), 634–646. https://doi.org/10.1006/jmaa.1993.1093 doi: 10.1006/jmaa.1993.1093
[7]	C. B. Wu, Y. L. Lin, Minimizing risk models in Markov decision processes with policies depending on target values, J. Math. Anal. Appl., 231 (1999), 47–67. https://doi.org/10.1006/jmaa.1998.6203 doi: 10.1006/jmaa.1998.6203
[8]	X. Wu, X. P. Guo, First passage optimality and variance minimization of Markov decision processes with varying discount factors, J. Appl. Probab., 52 (2015), 441–456. https://doi.org/10.1017/S0021900200012560 doi: 10.1017/S0021900200012560
[9]	Y. H. Huang, X. P. Guo, Optimal risk probability for first passage models in Semi-Markov processes, J. Math. Anal. Appl., 359 (2009), 404–420. https://doi.org/10.1016/j.jmaa.2009.05.058 doi: 10.1016/j.jmaa.2009.05.058
[10]	Y. H. Huang, X. P. Guo, Z. F. Li, Minimum risk probability for finite horizon semi-Markov decision processes, J. Math. Anal. Appl., 402 (2013), 378–391. https://doi.org/10.1016/j.jmaa.2013.01.021 doi: 10.1016/j.jmaa.2013.01.021
[11]	X. X. Huang, X. L. Zou, X. P. Guo, A minimization problem of the risk probability in first passage semi-Markov decision processes with loss rates, Sci. China Math., 58 (2015), 1923–1938. https://doi.org/10.1007/s11425-015-5029-x doi: 10.1007/s11425-015-5029-x
[12]	H. F. Huo, X. L. Zou, X. P. Guo, The risk probability criterion for discounted continuous-time Markov decision processes, Discrete Event Dyn. syst., 27 (2017), 675–699. https://doi.org/10.1007/s10626-017-0257-6 doi: 10.1007/s10626-017-0257-6
[13]	H. F. Huo, X. Wen, First passage risk probability optimality for continuous time Markov decision processes, Kybernetika, 55 (2019), 114–133. https://doi.org/10.14736/kyb-2019-1-0114 doi: 10.14736/kyb-2019-1-0114
[14]	H. F. Huo, X. P. Guo, Risk probability minimization problems for continuous time Markov decision processes on finite horizon, IEEE T. Automat. Contr., 65 (2020), 3199–3206. https://doi.org/10.1109/tac.2019.2947654 doi: 10.1109/tac.2019.2947654
[15]	X. Wen, H. F. Huo, X. P. Guo, First passage risk probability minimization for piecewise deterministic Markov decision processes, Acta Math. Appl. Sin. Engl. Ser., 38 (2022), 549–567. https://doi.org/10.1007/s10255-022-1098-0 doi: 10.1007/s10255-022-1098-0
[16]	A. Drake, Observation of a Markov process through a noisy channel, Massachusetts Institute of Technology, 1962.
[17]	K. Hinderer, Foundations of non-stationary dynamic programming with discrete time parameter, Berlin: Springer-Verlag, 1970.
[18]	D. Rhenius, Incomplete information in Markovian decision models, Ann. Statist., 26 (1974), 1327–1334. https://doi.org/10.1214/aos/1176342886 doi: 10.1214/aos/1176342886
[19]	O. Hernández-Lerma, Adaptive Markov control processes, New York: Springer-Verlag, 1989. https://doi.org/10.1007/978-1-4419-8714-3
[20]	R. D. Smallwood, E. J. Sondik, The optimal control of partially observable Markov processes over a finite horizon, Oper. Res., 21 (1973), 1071–1088. https://doi.org/10.1287/opre.21.5.1071 doi: 10.1287/opre.21.5.1071
[21]	K. Sawaki, A. Ichikawa, Optimal control for partially observable Markov decision processes over an infinite horizon, J. Oper. Res. Soc. JPN, 21 (1978), 1–16. https://doi.org/10.15807/jorsj.21.1 doi: 10.15807/jorsj.21.1
[22]	C. C.White, W. T. Scherer, Finite memory suboptimal design for partially observed Markov decision processes, Oper. Res., 42 (1994), 439–455. https://doi.org/10.1287/opre.42.3.439 doi: 10.1287/opre.42.3.439
[23]	E. A. Feinberg, P. O. Kasyanov, M. Z. Zgurovsky, Partially observable total cost Markov decision processes with weakly continuous transition probabilities, Math. Oper. Res., 41 (2016), 656–681. https://doi.org/10.1287/moor.2015.0746 doi: 10.1287/moor.2015.0746
[24]	M. Haklidir, H. Temeltas, Guided soft actor critic: A guided deep reinforcement learning approach for partially observable Markov decision processes, IEEE Access, 9 (2021), 159672–159683. https://doi.org/10.1109/access.2021.3131772 doi: 10.1109/access.2021.3131772
[25]	D. Bertsekas, S. Shreve, Stochastic optimal control: The discrete-time case, Athena Scientific, 1996.

Reader Comments

Your name:*

Email:*
© 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

AIMS Mathematics

1.8 3.4

Metrics

Article views(1365) PDF downloads(84) Cited by(0)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Figures(2)

AIMS Mathematics

The optimal probability of the risk for finite horizon partially observable Markov decision processes

Related Papers:

Abstract

1. Introduction

2. The model of POMDPs

3. Main result

4. Illustration

5. Conclusions

Use of AI tools declaration

Acknowledgments

Conflict of interest

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Catalog

AIMS Mathematics

The optimal probability of the risk for finite horizon partially observable Markov decision processes

Related Papers:

Abstract

1. Introduction

2. The model of POMDPs

3. Main result

4. Illustration

5. Conclusions

Use of AI tools declaration

Acknowledgments

Conflict of interest

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog