Research article

The optimal probability of the risk for finite horizon partially observable Markov decision processes

  • Received: 31 May 2023 Revised: 06 September 2023 Accepted: 21 September 2023 Published: 18 October 2023
  • MSC : 60J27, 90C40

  • This paper investigates the optimality of the risk probability for finite horizon partially observable discrete-time Markov decision processes (POMDPs). The probability of the risk is optimized based on the criterion of total rewards not exceeding the preset goal value, which is different from the optimal problem of expected rewards. Based on the Bayes operator and the filter equations, the optimization problem of risk probability can be equivalently reformulated as filtered Markov decision processes. As an advantage of developing the value iteration technique, the optimality equation satisfied by the value function is established and the existence of the risk probability optimal policy is proven. Finally, an example is given to illustrate the effectiveness of using the value iteration algorithm to compute the value function and optimal policy.

    Citation: Xian Wen, Haifeng Huo, Jinhua Cui. The optimal probability of the risk for finite horizon partially observable Markov decision processes[J]. AIMS Mathematics, 2023, 8(12): 28435-28449. doi: 10.3934/math.20231455

    Related Papers:

    [1] Xuan Jia, Junfeng Zhang, Tarek Raïssi . Linear programming-based stochastic stabilization of hidden semi-Markov jump positive systems. AIMS Mathematics, 2024, 9(10): 26483-26498. doi: 10.3934/math.20241289
    [2] Andrey Borisov . Filtering of hidden Markov renewal processes by continuous and counting observations. AIMS Mathematics, 2024, 9(11): 30073-30099. doi: 10.3934/math.20241453
    [3] Xiao Wu, Qi Wang, Yinying Kong . Two-person zero-sum stochastic games with varying discount factors. AIMS Mathematics, 2021, 6(10): 11516-11529. doi: 10.3934/math.2021668
    [4] Haifeng Zheng, Dan Wang . A study of value iteration and policy iteration for Markov decision processes in Deterministic systems. AIMS Mathematics, 2024, 9(12): 33818-33842. doi: 10.3934/math.20241613
    [5] A. Joumad, A. El Moutaouakkil, A. Nasroallah, O. Boutkhoum, Mejdl Safran, Sultan Alfarhood, Imran Ashraf . Unsupervised segmentation of images using bi-dimensional pairwise Markov chains model. AIMS Mathematics, 2024, 9(11): 31057-31086. doi: 10.3934/math.20241498
    [6] Hui Sun, Zhongyang Sun, Ya Huang . Equilibrium investment and risk control for an insurer with non-Markovian regime-switching and no-shorting constraints. AIMS Mathematics, 2020, 5(6): 6996-7013. doi: 10.3934/math.2020449
    [7] Wanlu Zhang, Hui Meng . Robust optimal investment-reinsurance strategies with the preferred reinsurance level of reinsurer. AIMS Mathematics, 2022, 7(6): 10024-10051. doi: 10.3934/math.2022559
    [8] Lin Xu, Linlin Wang, Hao Wang, Liming Zhang . Optimal investment game for two regulated players with regime switching. AIMS Mathematics, 2024, 9(12): 34674-34704. doi: 10.3934/math.20241651
    [9] Li Deng, Zhichao Chen . Optimal dividends in a discrete-time dual risk model with stochastic expenses. AIMS Mathematics, 2024, 9(11): 31696-31720. doi: 10.3934/math.20241524
    [10] Kai Xiao . Risk-seeking insider trading with partial observation in continuous time. AIMS Mathematics, 2023, 8(11): 28143-28152. doi: 10.3934/math.20231440
  • This paper investigates the optimality of the risk probability for finite horizon partially observable discrete-time Markov decision processes (POMDPs). The probability of the risk is optimized based on the criterion of total rewards not exceeding the preset goal value, which is different from the optimal problem of expected rewards. Based on the Bayes operator and the filter equations, the optimization problem of risk probability can be equivalently reformulated as filtered Markov decision processes. As an advantage of developing the value iteration technique, the optimality equation satisfied by the value function is established and the existence of the risk probability optimal policy is proven. Finally, an example is given to illustrate the effectiveness of using the value iteration algorithm to compute the value function and optimal policy.



    Analyzing the risk performance of a stochastic dynamic system is an important optimization control problem. Additionally, both theoretical and applied aspects are observed in relation to financial insurance [1], communication networks [2] and queuing systems [3]. Since the conventional expectation criterion could not effectively reflect the risk performance of the system, the criteria of risk probability were first proposed by Sobel [4], and implemented in Markov decision processes (MDPs). Afterwards, many scholars focused on the research of optimization problems of risk probability in MDPs. According to the characteristics of the sojourn time of the system state, these existing studies can be roughly divided into four categories: (i) Discrete-time Markov decision processes (DTMDPs)[5,6,7,8]; (ii) Semi-Markov decision processes (SMDPs) [9,10,11]; (iii) Continuous-time Markov decision processes (CTMDPs)[12,13,14]; and (iv) Piecewise deterministic Markov decision processes (PDMDPs)[15]. A common feature of these existing literatures is that the system state is completely observable. However, in practical applications such as machine maintenance and finance, the traditional models of MDPs cannot effectively depict these practical problems because the information of the decision environment cannot be completely observed or perceived. Therefore, it is necessary to establish partially observable Markov decision processes (POMDPs) to optimize the risk probability of the control system.

    Compared with completely observable Markov decision processes (COMDPs), the model of POMDPs is a more extensive stochastic control model with important theoretical significance and practical application values, and is widely used in fields such as industry, computational science, finance, and artificial intelligence. Therefore, many scholars began to focus on the problem of expected optimal for POMDPs. More specifically, Drake [16] established the POMDPs model, which attracted the attention of many experts and scholars. Regarding the expected optimal problem, Hinderer [17] discussed the finite state situation. Rhenius [18] and Hernˊandez-Lema [19] discussed a more general state situation. Smallwood and Sondik [20] further expanded the algorithm to calculate the optimal strategies and value function (VF) by employing the dynamic programming method. Sawaki and Ichikawa [21] proposed a successive approximation method to calculate the optimal strategy and VF. White and Scherer [22] solved the infinite discounted optimization problem by modifying the reward function and employing the iterative approximation algorithm. B¨auerle and Rieder [1] established the optimality equation by equivalently converting POMDPs into a filter MDPs model and proved the existence of an optimal strategy. Feinberg et al. [23] established some sufficient conditions to assure the existence of optimal strategies and an optimality equation for more general state and action spaces. In addition, many scholars have focused on the research of computational algorithms for the POMDPs [24]. However, these criteria mainly focus on the expected value of the total rewards, which could not effectively describe the risk situation faced by the control system. Therefore, it is necessary to introduce the criteria of the risk probability that can effectively demonstrate the risk performance of the system. An overview of the existing literature indicates that the criterion of the risk probability concerning POMDPs has not been researched thus far. This paper is the first attempt to solve the optimization problem of the risk probability for POMDPs.

    The optimization problem intends to minimize the risk probability criterion, that is, the probability value of the system's total rewards does not exceed the the profit goal. Because the reward levels are regarded as the second component of the extended state, it is necessary to redescribe the evolution process of the system and redefine the history-dependent, Markov and stationary policies. Thus, for any given redefined policy, a new probability space must be reconstructed using the well-known Ionescu Tulcea's theorem (see e.g., Proposition 7.45 in [25]), which is based on any initial system state and reward goal. Second, the unobservable state's conditional probability distribution is constructed, by redefining the Bayes operator (including the reward levels) and establishing the filter equations. Then, based on the aforementioned conditional probability distribution, a new filtered risk probability MDPs model is established by expanding the state and action space and modifying the transfer kernel and reward function. Furthermore, we prove that the newly filtered MDPs model can reveal the regular relationship between partially and completely observable optimal problems. On account of risk probability optimality theory for COMDPs, by using the value iteration advanced technique, the optimality equation is established, and the existence of optimal policies is proven. Finally, a machine maintenance example is given to present our main results, which include using the iteration algorithm to calculate the value function and an optimal policy.

    The rest of the manuscript is outlined as follows. Section 2 presents a minimization risk probability problem dealing for POMDPs. Section 3 presents the main results, including the existence of the optimality equation and optimal policies. An illustration is given to present the value iteration algorithm for calculating the VF and OP in Section 4.

    The model of POMDPs consists of the following elements:

    {EX×EY,{A(x)A,xEX},Q(,x,y,a),r(x,y,a),Q0} (2.1)

    which have the following meanings:

    (a) EX×EY represents a Borel space with a Borel σ-algebra B(EX×EY). The element (x,y)EX×EY is the system state, where x denotes the state's observable portion, and y denotes the state's unobservable portion.

    (b) A represents the action space of a Borel space with a Borel σ-algebra A. A(x)A represents the set of admissible actions in state xEX, which is assumed to be finite. Moreover, the set of all composable pairs of state actions is denoted by K:={(x,a)xEX,aA(x)}.

    (c) Q(,x,y,a) denotes the probability of the transition from EX×EY×A to EX×EY, which is used to describe the transition mechanism in the controlled state process. For simplicity, we introduce QX to represent the marginal transition probability QX(B|x,y,a):=QX(B×EY|x,y,a).

    (e) r(x,y,a) denotes a nonnegative real-valued measurable reward function from K×EY to R+:=[0,+).

    (d) Q0 denotes the initial probability distribution of the unobservable state.

    The evolution of the risk probability POMDP is characterized as follows: At s0=0, based on the observed state x0 and the reward goal (reward level) ˜λ0:=λ0, the decision maker could pick an action a0 from the set of allowed actions A(x0). Then, the observed state of the system stays until time s1=1, at which point, the system state transfers to the state x1B1EX based on the probability of the transition B1EYQ(x1,y1x0,y0,a0)Q0(dy0). Meanwhile, the unobservable state y0 also transfers to the next state y1 with a certain probability, which is constructed by the Bayes operator in the undermentioned (3.3). Moreover, during this period, the control system will generate the rewards r(x0,y0,a0). Then, the goal of the corresponding reward would become ˜λ1=λ0r(x0,y0,a0). At the new decision time s1=1, based on the observable information of the system h1=(x0,λ0,a0,x1,˜λ1), the decision maker picks a new action a1A(x1). Afterward, the system evolves similarly and produces a so-called observable history up to time sk=k:

    hk:=(x0,λ0,a0,x1,˜λ1,a1,,xk,˜λk), (2.2)

    where xk,yk denote the system state's observable and unobservable section at the k-th moment of decision, respectively, ak represents the action chosen by the decision maker at time sk=k, ˜λk denotes the reward goal, which means that the decision maker will try his/her best to regulate the total rewards not exceeding the goal of the reward, and it conforms to the following relation:

    ˜λk+1:=˜λkr(xk,yk,ak), (2.3)

    for k=0,1,.

    The sets of all the histories of observable hk are represented by H0:=EX×R, Hk:=Hk1×A×EX×R for k1. Based on all observable histories, some policies are introduced.

    Definition 2.1. (a) A sequence π={πk,k0} is said to be a randomized history-dependent policy if a stochastic kernel πk:HkA satisfying the following:

    πk(A(xk)|hk)=1   for all hkHk,k=0,1,2.

    (b) A randomized history-dependent policy is said to be deterministic if there exists a sequence {gk} of measurable functions gk from Hk to A with gk(hk)A(xk), and πk(|hk) is the Dirac measure at gk(hk) for all hkHk,k0. The sets of all the randomized, deterministic policies are represented by Π,ΠDH, respectively.

    The risk probability POMDP needs to consider the system state and the reward goal, which is different from the conventional expectation MDP that only considers the system state. Thus, the results of the available classical expectation MDP cannot be applied to the proposed model. First, we need to reconstruct the measurable space (Ω,F) as follows: Ω:={(x0,y0,λ0,a0,x1,y1,λ1,a1,,xk,yk,λkak,,)|x0EX,y0EY,λ0R,a0A,xlE,ylEY,λlR,alA with 1lk,k1} denotes the sample space, which is endowed with the Borel σ-algebra F. For any ω:=(x0,y0,λ0,a0,x1,y1,λ1,a1,,xk,yk,λkak,,)Ω, some random variables are defined as follows:

    Xk(ω):=xk,Yk(ω):=yk,Λk(ω):=λk,Ak(ω):=ak,k0.

    The ω will be omitted for convenience.

    For any policy πΠ, (x,λ)EX×R, Q0 of Y0 on EY, based on the Ionescu Tulcea's (e.g., Proposition 7.45 in [25]), the unique probability measure Pπ(x,λ)=EYPπ(x,λ,y)()Q0(dy) on (Ω,F) is constructed as follows: for all BB(EX),CB(EY),GB(R),DB(A),hkHk,k=0,1,

    Pπ(x,λ,y)(X0=x,Λ0=λ)=1, (2.4)
    Pπ(x,λ,y)(AkD|hk)=Dπk(dak|hk), (2.5)
    Pπ(x,λ,y)(Xk+1B,Yk+1C,Λk+1G|hk,yk,ak)=BCGQ(dxk+1,dyk+1|xk,yk,ak)×I[λkr(xk,yk,ak)](dλk+1). (2.6)

    The expectation operator corresponding to the probability measure Pπ(x,λ) can be expressed as Eπ(x,λ).

    For any (x,λ)EX×R and πΠ, the risk probability criterion of POMDPs is defined as follows:

    FπN(x,λ)=EYPπ(x,y,λ)(Nn=0r(Xn,Yn,An)λ)Q0(dy). (2.7)

    Then, FN(x,λ):=infπΠFπN(x,λ) is known as the risk probability value function.

    Definition 2.2. A policy πΠ is called the optimal of the risk probability if

    FπN(x,λ)=FN(x,λ)   for all (x,λ)E×R. (2.8)

    The general objective of the manuscript is to optimize the criterion of the risk probability and establish both the optimality equation's solution and the existence of the strategy's conditions.

    Notation: Let P(EY) be the space of all the probability measures on EY.

    Assumption 3.1. Assume that the transition probability has a probability density function q that satisfies Q(d(x,y)|x,y,a)=q(x,y|x,y,a)ρ(dx)ν(dy) for some σ-finite measures ρ and ν.

    The Bayes operator Φ:EX×R×A×EX×R×P(EY)P(EY) is first defined as follows:

    Φ(x,λ,a,x,λ,μ)(C):=CEYq(x,y|x,y,a)I{λr(x,y,a)}(λ)μ(dy)ν(dy)EYEYq(x,y|x,y,a)μ(dy)ν(dy), (3.1)

    where CB(EY), μ denotes the distribution of the unobservable state. Furthermore, by using the iterative method, for any hkHk,k=0,1,, the conditional probability distribution μn of the unobservable variable Yn is presented by the following:

    μ0(C|h0):=Q0(C), (3.2)
    μk+1(C|hk,a,x,λ):=Φ(xk,λk,a,x,λ,μk(|hk))(C), (3.3)

    which are called filter equations.

    Lemma 3.1. Under Assumption 3.1, for any πΠ,BB(EY), the following statement holds:

    Pπ(x,λ)(YnB|X0,Λ0,A0,,Xn,Λn)=μn(B|X0,Λ0,A0,,Xn,Λn). (3.4)

    Proof. For each πΠ and xEX,λR, the following result is proven by induction:

    Eπ(x,λ)[V(X0,Λ0,A0,,Xn,Λn,Yn)]=Eπ(x,λ)[ˆV(X0,Λ0,A0,,Xn,Λn)], (3.5)

    for the bounded and measurable function V:Hk×EYR and ˆV(hn)=V(hn,yn)μn(dyn|hn). Since ˆV(h0)=V(h0,y)Q0(dy), Fact (3.5) is true when n=0. For any given hn1Hn1,n1, suppose that the Fact (3.5) holds for k=n1. Using (2.6), the Bayes operator's definition and Fubini's theorem, we obtain the following:

    Eπ(x,λ)[ˆV(hn1,An1,Xn,Λn)]=EYμn1(dyn1|hn1)EXAQX(dxn|xn1,yn1,an1)×RI{λn1r(xn1,yn1,an1)}(dλn)ˆV(hn1,an1,xn,λn)πn1(dan1|hn1)=EYμn1(dyn1|hn1)EXAQX(dxn|xn1,yn1,an1)RI{λn1r(xn1,yn1,an1)}(dλn)×EYV(hn1,an1,xn,λn,yn)μn(dyn|hn1,an1,xn,λn)πn1(dan1|hn1)=EYμn1(dyn1|hn1)EXEYAρ(dxn)ν(dy)q(xn,y|xn1,yn1,an1)RI{λn1r(xn1,yn1,an1)}(dλn)×EYV(hn1,an1,xn,λn,yn)Φ(xn1,λn1,an1,xn,λn,μn1)(dyn)πn1(dan1|hn1)=EYμn1(dyn1|hn1)EXEYAq(xn,yn|xn1,yn1,an1)ρ(dxn)ν(dyn)×V(hn1,an1,xn,λn1r(xn1,yn1,an1),yn)πn1(dan1|hn1), (3.6)

    On the other hand, by induction, we have the following:

    Eπ(x,λ)[V(hn1,An1,Xn,Λn,Yn)]=EYμn1(dyn1|hn1)EXEYAQ(d(xn,yn)|xn1,yn1,an1)×RI{λn1r(xn1,yn1,an1)}(dλn)V(hn1,an1,xn,λn,yn)πn1(dan1|hn1),=EYμn1(dyn1|hn1)EXEYAq(xn,yn|xn1,yn1,an1)ρ(dxn)ν(dyn)×RI{λn1r(xn1,yn1,an1)}(dλn)V(hn1,an1,xn,λn,yn)πn1(dan1|hn1),=EYμn1(dyn1|hn1)EXEYAq(xn,yn|xn1,yn1,an1)ρ(dxn)ν(dyn)×V(hn1,an1,xn,λn1r(xn1,yn1,an1),yn)πn1(dan1|hn1), (3.7)

    which, together with Eq (3.6), implies the fact that (3.5) is satisfied. Specially, V(X0,Λ0,A0,,Xn,Λn,Yn)=IB×C(Yn,(X0,Λ0,A0,,Xn,Λn)), we obtain the following:

    Pπ(x,λ)(YnB,(X0,Λ0,A0,,Xn,Λn)C)=Eπ(x,λ)[μn(B|X0,Λ0,A0,,Xn,Λn)IC(X0,Λ0,A0,,Xn,Λn)],

    which implies that the Lemma holds.

    The partially observable risk probability MDPs can be transformed into the filtered risk probability MDPs by enlarging the state space, modifying the transfer kernel and reward function.

    Definition 3.1. The filtered model of POMDPs consists of the following elements {E,A,ˆQ,ˆr}, which have the following meanings:

    E:=EX×P(EY) denotes the state space, and its element is marked as (x,μ)E, where x denotes the observable state and μ denotes the unobservable state's conditional probability distribution.

    A denotes the action space. A(x,μ):=A(x)A denotes the class of selectable actions in the state (x,μ)E.

    ˆr denotes a nonnegative real-valued measurable reward function on K and satisfies the following:

    ˆr(x,μ,a)=r(x,y,a)μ(dy).

    ˆQ denotes the transition law from E×R×A to E×R, which is specifically expressed as follows:

    ˆQ(B×C×D|x,λ,μ,a):=BCEYID(Φ(x,λ,a,ˆx,ˆλ,μ))I{λˆr(x,μ,a)}(dˆλ)×QX(dˆx|x,μ,a),

    where QX(B|x,μ,a)=BQX(B|x,y,a)μ(dy) for all (x,μ)E,λR,aA(x),BEX,DP(EY),CR.

    To strictly assure the optimal problems normalization, some notations and the definition of some policies are given in the filtered MDPs. Φ stands for the class of stochastic kernels φ on A provided E×R with the property φ(A(x)|x,λ,μ)=1 for all (x,λ,μ)E×R. F stands for the class of all the measurable mappings f from E×R to A with f(x,λ,μ)A(x) for all (x,λ,μ)E×R.

    Definition 3.2. A randomized Markov policy is a sequence πM={ˆφk,k0} of stochastic kernels ˆφkΦ satisfying ˆφk(A(xk)|xk,λk,μk)=1 for each μkP(EY),k0. This randomized Markov policy is represented as πM={ˆφk}.

    A randomized Markov policy πM={ˆφk} is called a deterministic Markov if a function sequence {fk,k0} exists such that ˆφk(|xk,λk,μk) is concentrated at fk(xk,λk,μk) for any fkF.

    The class of all randomized and deterministic Markov policies are recorded as ΠRM,ΠDM, respectively. In fact, from the above definition, these randomized Markov policies rely on historical information hk,k0. Then, for any πM={ˆφ0,ˆφ1,}ΠM, we can find a policy π={π0,π1,}Π that satisfies the following:

    π0(da0|x0,y0,λ0):=ˆφ0(da0|x0,μ0(y0|x0,λ0),λ0), (3.8)
    πk(dak|hk):=ˆφk(dak|xk,μk(yk|hk),λk), (3.9)

    for k0,hkHk. Thus, ΠDMΠRMΠ.

    Based on the probability of the transition ˆQ and initial distribution μ0, for any (x,λ,μ)E×R and πΠ, according to the Ionescu Tulcea theorem (e.g., Proposition 7.45 in [25]), the probability measure ˆPπ(x,λ,μ) can be constructed on (Ω,F) as follows:

    ˆPπ(x,λ,μ)(X0=x,Λ0=λ)=1, (3.10)
    ˆPπ(x,λ,μ)(AkG|hk)=πk(G|hk), (3.11)
    ˆPπ(x,λ,μ)(Xk+1B,Yk+1C,Λk+1D|hk,μk)=BCDA׈Q(dxk+1,dλk+1,dμk+1|xk,λk,μk,ak)×πk(dak|hk), (3.12)

    for all hkHk,aA(xk),BEX,CEY,DR,GB(A). The expected operator corresponds to the probability measure ˆPπ(x,λ,μ) and is expressed as ˆEπ(x,λ,μ).

    For any (x,λ,μ)E×R and πΠ, the value function of the filtered MDP is given by the following:

    UπN(x,λ,μ):=ˆPπ(x,λ,μ)(Nn=0ˆr(Xn,μn,An)λ), (3.13)
    UN(x,λ,μ):=infπΠUπN(x,λ,μ). (3.14)

    Notation: For any policy πΠ and (x,λ,μ)E×R, the risk probability of the total rewards Uπn is defined as follows:

    Uπn(x,λ,μ):=ˆPπ(x,λ,μ)(nk=0ˆr(Xk,μk,Ak)λ),

    with n=0,1,,N.

    Moreover, the minimal risk probability of the filtered MDPs model is defined by the following:

    Un(x,λ,μ):=infπΠUπn(x,λ,μ).

    Let U be the class of mappings U:E×R[0,1]. For any (x,λ,μ)E×R,UU, φϕ, and aA(x), the operators TφU and TU are defined as follows:

    TaU(x,λ,μ):=EXEYU(ˆx,λˆr(x,μ,a),Φ(x,λ,a,ˆx,λˆr(x,μ,a),μ))×QX(dˆx|x,μ,a),TφU(x,λ,μ):=ATaU(x,λ,μ)φ(da|x,λ,μ), (3.15)
    TU(x,λ,μ):=minaA(x)TaU(x,λ,μ). (3.16)

    To strictly show the unobservable state's conditional distribution, some characteristics of the filter equation are given.

    Lemma 3.2. Under Assumption 3.1, for each πΠ, xEX,λR. Then, FπN(x,λ)=UπN(x,λ,Q0), FN(x,λ)=UN(x,λ,Q0).

    Proof. For each πΠ and xEX, the following result is first proven by induction:

    Fπn(x,Λ)=Uπn(x,ˆΛ,μ), (3.17)

    with n=1,0,1,2,, for the reward goal function Λ:K×EYR+ and ˆΛ=Λ(x,y,a)μ(dy).

    Based on Fπ1=Uπ1=I[0,+)(λ), for any πΠ and xEX,λR, Eq (3.17) is valid when n=1. Suppose that Fact (3.17) holds for k=n; by (2.6),

    Uπn+1(x,λ,μ0)=ˆPπ(x,λ,μ0)(n+1k=0ˆr(Xk,μk,Ak)λ)=ˆEπ(x,λ,μ0)[ˆEπ(x,λ,μ0)[I{n+1k=0ˆr(Xk,μk,Ak)λ}|X0,Λ0,μ0,A0,X1,Λ1,μ1]]=EXRP(EY)AˆEπ(x,λ,μ)[I{n+1k=0ˆr(Xk,μk,Ak)λ}|X0=x,Λ0=λ,μ0=Q0,A0=a0,X1=x1,×λ1=λˆr(x,Q0,a0),μ1=Φ(x,λ,a0,x1,λ1,Q0)]׈Q(dx1,dλ1,dμ1|x,λ,Q0,a0)π0(da0|x,λ)=EXAEYˆP1π(x1,λˆr(x,Q0,a0),Φ(x,λ,a0,x1,λ1,Q0))(nk=0ˆr(Xk,μk,Ak)λ1)×QX(dx1|x,Q0,a0)π0(da0|x,λ)=EXAEYU1πn(x1,λˆr(x,Q0,a0),Φ(x,λ,a0,x1,λ1,Q0))×QX(dx1|x,Q0,a0)π0(da0|x,λ) (3.18)

    is obtained, where 1π:={π1,π2,...} represents the 1-shift policy of π.

    On the other side, by Eq (3.4), we have the following:

    Fπn+1(x,λ)=EYPπ(x,y,λ)(n+1k=0r(Xk,Yk,Ak)λ)Q0(dy)=EYEπ(x,y,λ)[Eπ(x,y,λ)[I{n+1k=0r(Xk,Yk,Ak)λ}|X0,Λ0,Y0,A0,X1,Λ1,Y1]]Q0(dy)=EYEXRAEYEπ(x,y,λ)[Eπ(x,y,λ)[I{n+1k=0r(Xk,Yk,Ak)λ}|X0=x,Λ0=λ,Y0=y,A0=a0,X1=x1,Λ1=λ1,Y1=y1]]Φ(x,λ,a0,x1,λ1,Q0)(dy1)QX(dx1|x,y,a0)I{λr(x,y,a)}(dλ1)π0(da0|x,λ)Q0(dy)=EYEXRAEYP1π(x1,y1,λr(x,y,a0))(nk=0r(Xk,Yk,Ak)λr(x,y,a0))Φ(x,λ,a0,x1,λ1,Q0)(dy1)×QX(dx1|x,y,a0)I{λr(x,y,a0)}(dλ1)π0(da0|x,λ)Q0(dy)=EXAEYF1πn(x1,λr(x,y,a0))QX(dx1|x,y,a0)π0(da0|x,λ)Q0(dy),

    which, together with Eq (3.18) and the inductive hypothesis, can prove Eq (3.17) for n=0,1,,N, i.e, Fπn(x,λ)=Uπn(x,λ,μ).

    For n=N, FπN(x,λ)=UπN(x,λ,μ), which yields FN(x,λ)=UN(x,λ,Q0) for the arbitrary policy π.

    The establishment of the optimality equation requires the following theorem.

    Theorem 3.1. Suppose that Assumption 3.1 is satisfied. Then, for any (x,λ,μ)E×R,π={π0,π1,}Π,n0, the following statement holds: Uπn+1(x,λ,μ)=Tπ0U1πn(x,λ,μ), where Uπ0(x,λ,μ)=I[0,+)(λ),1π:={π1,π2,...} represents the 1-shift policy of π.

    Proof. For any (x,λ,μ)E×R,π={π0,π1,}Π,n=0,1,,N1, by (3.8) and the properties of conditional expectation, we can obtain the following:

    Uπn+1(x,λ,μ)=ˆPπ(x,λ,μ)(n+1k=0ˆr(Xk,μk,Ak)λ)=ˆEπ(x,λ,μ)[ˆEπ(x,λ,μ)[I{n+1k=0ˆr(Xk,μk,Ak)λ}|X0,Λ0,μ0,A0,X1,Λ1,μ1]]=AEXRP(EY)ˆPπ(x,λ,μ)(n+1k=0ˆr(Xk,μk,Ak)λ|X0=x,Λ0=λ,μ0=Q0,A0=a,X1=ˆx,Λ1=ˆλ,μ1=ˆμ)ˆQ(dˆx,dˆλ,dˆμ|x,λ,μ,a)π0(da|x,λ)=EXRAEYˆPπ(x,λ,μ)(n+1k=0ˆr(Xk,μk,Ak)λ|X0=x,Λ0=λ,μ0=Q0,A0=a,X1=ˆx,Λ1=ˆλ,μ1=ˆμ)QX(dˆx|x,μ,a)I{λˆr(x,μ,a)}(dˆλ)π0(da|x,λ)=EXAEYˆP1π(ˆx,λˆr(x,μ,a),Φ(x,λ,a,ˆx,λˆr(x,μ,a),μ))(nk=0ˆr(Xk,μk,Ak)λˆr(x,μ,a))×QX(dˆx|x,μ,a)π0(da|x,λ)=EXAEYU1πn(ˆx,λˆr(x,μ,a),Φ(x,λ,a,ˆx,λˆr(x,μ,a),μ))QX(dˆx|x,μ,a)π0(da|x,λ):=Tπ0U1πn(x,λ,μ).

    The proof of this conclusion has been completed.

    Theorem 3.2. Suppose that Assumption 3.1 holds. For each (x,λ,μ)E×R, then:

    (a) {Un,n=0,1,,N1} satisfies the corresponding optimality equation:

    U0(x,λ,μ):=I[0,)(λ),Un+1(x,λ,μ):=TUn(x,λ,μ).

    (b) There exists a policy gnΠDM such that Un+1=TgnUn for n=0,1,,N1. Then, the policy π:={f0,f1,,fN1}ΠDH is optimal, where fn(hn):=gN1n(xn,λn,μn) for each hnHn, n=0,1,,N1.

    Proof. (a) According to Theorem 3.1 and (3.16), for each (x,λ,μ)E×R,π={π0,π1,}Π, we have the following:

    Uπn+1(x,λ,μ)=Tπ0U1πn(x,λ,μ)Tπ0Un(x,λ,μ)TUn(x,λ,μ). (3.19)

    Since π is arbitrary, this implies Un+1(x,λ,μ)TUn(x,λ,μ).

    To prove the reverse condition, the following fact is needed to be proven: for any (x,λ,μ)E×R and n1, there is a policy ηΠDM which satisfies Un(x,λ,μ)=Uηn(x,λ,μ). Since U1(x,λ,μ)=Uπ1(x,λ,μ)=I[0,+)(λ) for any πΠM, this fact trivially holds for n=1. Assume that there exists a policy ζΠDM that satisfies Uk(x,λ,μ)=Uζk(x,λ,μ) for n=k1. On the other hand, since the set of actions is finite, there exists a policy fΠs that satisfies TfUk(x,λ,μ)=TUk(x,λ,μ). Then, let θ={f,ζ}ΠDM, we know that

    Uk+1(x,λ,μ)Uθk+1(x,λ,μ)=TfUζk(x,λ,μ)=TfUk(x,λ,μ)=TUk(x,λ,μ), (3.20)

    where the first equality is obtained by Lemma 3.1, and the second equality follows from the induction hypothesis. Combining (3.19) and (3.20), we have TUk=Uk+1, which implies the induction hypothesis is satisfied and the result is proven.

    (b) For each (x,λ,μ)E×R, the existence of a policy gnΠDM satisfying Un+1=TgnUn, is determined by the finiteness of the action set for n=0,1,,N1. Letting π=π(n):={gn,gn1,,g0}Π, when n=0, by Lemma 3.1 (b), Uπ1=Tg0U1π0=Tg0U0=TU0=U1. Assuming that Uk=U1πk for n=k, by Lemma 3.1 (b) and part (a),

    Uπk+1=TgkU1πk=TgkUk=TUk=Uk+1.

    Thus, the induction hypothesis is established and UπN=UN for π:={f0,f1,,fN1} with fn(hn):=gN1n(xn,λn,μn(|hn)),n=0,1,,N1. Then, the policy π is optimal.

    Based on Theorem 3.2, the value iteration algorithm is established as follows:

    The value iteration algorithm

    Step 1. Let U0(x,λ,μ):=I[0,+)(λ), for (x,λ,μ)E×R.

    Step 2. The computation of the value function Un is as follows for n=0,1,,N1:

    TaUn(x,λ,μ)=EXEYUn(ˆx,λˆr(x,μ,a),Φ(x,λ,a,ˆx,λˆr(x,μ,a),μ))×QX(dˆx|x,μ,a).Un+1(x,λ,μ)=minaA(x){TaUn(x,λ,μ)}.

    Step 3. Find a policy gN1 that satisfies UN=TgN1UN1. Then, by Theorem 3.2, the policy π is optimal.

    An illustration is provided to show how both the VF and OP are calculated and illustrates the effectiveness and feasibility of the value iteration algorithm.

    Example 4.1. Consider a machine production process with two types of observable product quality states (i.e., nonconforming product 0 and qualified products 1), and two types of unobservable machine operation states (i.e. poor state 1 and good state 2). According to the product quality situation x=1 and the reward goal λ, at the initial time n=0, when the production process is in the state y{1,2}, the decision-maker can either select an ordinary maintenance action a11 or an advanced maintenance action a12 with a reward r(x,y,a). If the product quality situation is x=0, the decision maker must select an action of the advanced maintenance a01. When the action of the maintenance a is applied, the system transits to the state (x,y) with probability Q(,|x,y,a) at the next moment. The general objective of the decision maker is to select the optimal action to ensure that the minimum probability value of the total rewards does not exceed the target λ from 0 to N=15.

    This evolution process can be formulated as a discrete-time POMDP with the state space EX×EY={0,1}×{1,2}; the admissible class of actions A(0)={a01},A(1)={a11,a12}. Assume that the probabilities of the transition are given by Q(,|x,y,a)=QX(|x,y,a)p(|y), in which the probabilities of the transition QX(|x,y,a) are given by the following:

    QX(0|0,1,a01)=1,QX(1|0,1,a01)=0,QX(0|0,2,a01)=1,QX(1|0,2,a01)=0,QX(0|1,1,a11)=0.5,QX(1|1,1,a11)=0.5,QX(0|1,1,a12)=0.3,QX(0|1,1,a12)=0.7,QX(0|1,2,a11)=0.4,QX(1|1,2,a11)=0.6,QX(0|1,2,a12)=0.2,QX(0|1,2,a12)=0.8, (4.1)

    The transition probabilities of the unobserved state are given by p(22)=1p(12)=0.7,p(11)=1. The reward rates are given as follows:

    r(0,1,a01)=r(0,2,a01)=0, r(1,1,a11)=2, r(1,1,a12)=4, r(1,2,a11)=1, r(1,2,a22)=3.

    Our main goal is to use the value iteration algorithm to compute the value function and the optimal policies.

    {First, according to (3.13), since r(0,1,a01)=r(0,2,a01)=0, it is known that U(0,λ,μ)=I[0,+)(λ). Based on the value iteration algorithm (Algorithm 1) and Matlab software, the curves of functions Ta11U(1,λ,μ), Ta12U(1,λ,μ) and the approximated value function U(1,λ,μ) are plotted (see Figures 1 and 2). By observing the figures, the following conclusions are attained:

    Figure 1.  The function TaUN(1,λ,μ).
    Figure 2.  The value function UN(1,λ,μ).

    (a) As seen in Figure 1, when x=1, if λ(0,4), the value Ta12UN(1,λ,μ) is less than Ta11UN(1,λ,μ). Otherwise, if λ[4,+), the value Ta11UN(1,λ,μ) is less than Ta12UN(1,λ,μ). As shown above, the observable state is x=1, λ(0,4), the decision maker should choose the low risk action a12. Conversely, if λ[4,+), the decision maker should choose the low risk action a11 instead of the action a12.

    (b) Based on Figure 1, the risk probability optimal policy for POMDPs at time n=0,1,,N is given by the following:

    f(1,λ)={a12,0λ<4;a11,λ4. (4.2)

    In this paper, we studied the problem of minimizing the risk probability criterion for finite horizon partially observable discrete-time Markov decision processes (POMDPs). Different from the classical expectation criterion, which are regarded as a component of an extended state according to the reward levels, we redefined a history-dependent policy, and reconstructed a new probability measure. Based on the Bayes operator and the filter equations we constructed, the optimization problem of risk probability can be equivalently reformulated as filtered Markov decision processes. We proposed a value iteration algorithm to establish the existence of a solution to the optimality equation, and a risk probability optimal policy.

    The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

    This work was supported by Guangxi science and technology base and talent project (Grant No. AD21159005); Foundation of Guangxi Educational Committee (Grant No. KY2022KY0342); National Natural Science Foundation of China (Grant No. 11961005, 12361091); Guangxi Natural Science Foundation Program (Grant No. 2020GXNSFAA297196); The Doctoral Foundation of Guangxi University of Science and Technology (Grant No. 18Z06).

    The authors declare that there is no conflict of interests regarding the publication of this paper.



    [1] N. Bauerle, U. Rieder, Markov decision processes with applications to finance, Heidelberg: Springer, 2011. https://doi.org/10.1007/978-3-642-18324-9
    [2] J. Janssen, R. Manca, Semi-Markov risk models for finance, insurance and reliability, New York: Springer, 2006. https://doi.org/10.1007/0-387-70730-1
    [3] X. P. Guo, O. Hernández-Lerma, Continuous-time Markov decision processes: Theorey and applications, Berlin: Springer-Verlag, 2009. https://doi.org/10.1007/978-3-642-02547-1
    [4] M. J. Sobel, The variance of discounted Markov decision processes, J. Appl. Probab., 19 (1982), 794–802. https://doi.org/10.1017/s0021900200023123 doi: 10.1017/s0021900200023123
    [5] Y. Ohtsubo, K. Toyonaga, Optimal policy for minimizing risk models in Markov decision processes, J. Math. Anal. Appl., 271 (2002), 66–81. https://doi.org/10.1016/s0022-247x(02)00097-5 doi: 10.1016/s0022-247x(02)00097-5
    [6] D. J. White, Minimizing a threshold probability in discounted Markov decision processes, J. Math. Anal. Appl., 173 (1993), 634–646. https://doi.org/10.1006/jmaa.1993.1093 doi: 10.1006/jmaa.1993.1093
    [7] C. B. Wu, Y. L. Lin, Minimizing risk models in Markov decision processes with policies depending on target values, J. Math. Anal. Appl., 231 (1999), 47–67. https://doi.org/10.1006/jmaa.1998.6203 doi: 10.1006/jmaa.1998.6203
    [8] X. Wu, X. P. Guo, First passage optimality and variance minimization of Markov decision processes with varying discount factors, J. Appl. Probab., 52 (2015), 441–456. https://doi.org/10.1017/S0021900200012560 doi: 10.1017/S0021900200012560
    [9] Y. H. Huang, X. P. Guo, Optimal risk probability for first passage models in Semi-Markov processes, J. Math. Anal. Appl., 359 (2009), 404–420. https://doi.org/10.1016/j.jmaa.2009.05.058 doi: 10.1016/j.jmaa.2009.05.058
    [10] Y. H. Huang, X. P. Guo, Z. F. Li, Minimum risk probability for finite horizon semi-Markov decision processes, J. Math. Anal. Appl., 402 (2013), 378–391. https://doi.org/10.1016/j.jmaa.2013.01.021 doi: 10.1016/j.jmaa.2013.01.021
    [11] X. X. Huang, X. L. Zou, X. P. Guo, A minimization problem of the risk probability in first passage semi-Markov decision processes with loss rates, Sci. China Math., 58 (2015), 1923–1938. https://doi.org/10.1007/s11425-015-5029-x doi: 10.1007/s11425-015-5029-x
    [12] H. F. Huo, X. L. Zou, X. P. Guo, The risk probability criterion for discounted continuous-time Markov decision processes, Discrete Event Dyn. syst., 27 (2017), 675–699. https://doi.org/10.1007/s10626-017-0257-6 doi: 10.1007/s10626-017-0257-6
    [13] H. F. Huo, X. Wen, First passage risk probability optimality for continuous time Markov decision processes, Kybernetika, 55 (2019), 114–133. https://doi.org/10.14736/kyb-2019-1-0114 doi: 10.14736/kyb-2019-1-0114
    [14] H. F. Huo, X. P. Guo, Risk probability minimization problems for continuous time Markov decision processes on finite horizon, IEEE T. Automat. Contr., 65 (2020), 3199–3206. https://doi.org/10.1109/tac.2019.2947654 doi: 10.1109/tac.2019.2947654
    [15] X. Wen, H. F. Huo, X. P. Guo, First passage risk probability minimization for piecewise deterministic Markov decision processes, Acta Math. Appl. Sin. Engl. Ser., 38 (2022), 549–567. https://doi.org/10.1007/s10255-022-1098-0 doi: 10.1007/s10255-022-1098-0
    [16] A. Drake, Observation of a Markov process through a noisy channel, Massachusetts Institute of Technology, 1962.
    [17] K. Hinderer, Foundations of non-stationary dynamic programming with discrete time parameter, Berlin: Springer-Verlag, 1970.
    [18] D. Rhenius, Incomplete information in Markovian decision models, Ann. Statist., 26 (1974), 1327–1334. https://doi.org/10.1214/aos/1176342886 doi: 10.1214/aos/1176342886
    [19] O. Hernández-Lerma, Adaptive Markov control processes, New York: Springer-Verlag, 1989. https://doi.org/10.1007/978-1-4419-8714-3
    [20] R. D. Smallwood, E. J. Sondik, The optimal control of partially observable Markov processes over a finite horizon, Oper. Res., 21 (1973), 1071–1088. https://doi.org/10.1287/opre.21.5.1071 doi: 10.1287/opre.21.5.1071
    [21] K. Sawaki, A. Ichikawa, Optimal control for partially observable Markov decision processes over an infinite horizon, J. Oper. Res. Soc. JPN, 21 (1978), 1–16. https://doi.org/10.15807/jorsj.21.1 doi: 10.15807/jorsj.21.1
    [22] C. C.White, W. T. Scherer, Finite memory suboptimal design for partially observed Markov decision processes, Oper. Res., 42 (1994), 439–455. https://doi.org/10.1287/opre.42.3.439 doi: 10.1287/opre.42.3.439
    [23] E. A. Feinberg, P. O. Kasyanov, M. Z. Zgurovsky, Partially observable total cost Markov decision processes with weakly continuous transition probabilities, Math. Oper. Res., 41 (2016), 656–681. https://doi.org/10.1287/moor.2015.0746 doi: 10.1287/moor.2015.0746
    [24] M. Haklidir, H. Temeltas, Guided soft actor critic: A guided deep reinforcement learning approach for partially observable Markov decision processes, IEEE Access, 9 (2021), 159672–159683. https://doi.org/10.1109/access.2021.3131772 doi: 10.1109/access.2021.3131772
    [25] D. Bertsekas, S. Shreve, Stochastic optimal control: The discrete-time case, Athena Scientific, 1996.
  • Reader Comments
  • © 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(1365) PDF downloads(84) Cited by(0)

Figures and Tables

Figures(2)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog