1. Introduction
Many natural phenomena, including the physical, chemical, and biological, can be viewed as computing processes [1,2,3]. Inspired by such natural phenomena, many search algorithms for combinatorial optimization problems have been proposed to quickly obtain high quality solutions, such as simulated annealing [4], neural networks [5], genetic algorithms [6], and DNA computing [7]. In this paper, we propose a new computing architecture that utilizes the desirable physical properties of “atomic switches” [8,9] to solve a decision-making problem. If this architecture could be implemented in physical devices in such a way that their computing processes are elegantly coupled with their underlying physics, we would be able to utilize their abilities to make accurate and speedy decisions in uncertain environments [10].
Decision-making is one of the most important intellectual abilities of human beings. In the context of reinforcement learning, the multi-armed bandit problem (MAB) was originally described by Robbins [11], although the essence of the problem had been studied earlier by Thompson [12]. Suppose there are M slot machines, each of which returns a reward; for example, coins, with a certain probability density function (PDF) that is unknown to a player. Let us consider a minimal case: two machines A and/or B give rewards with individual PDF whose mean reward is μA and μB, respectively. The player makes a decision on which machine to play at each trial, trying to maximize the total reward obtained after repeating several trials. The MAB is used to determine the optimal strategy for playing machines as accurately and quickly as possible by referring to past experience.
The optimal strategy, called the “Gittins index, ” is known only for a limited class of problems in which the reward distributions are assumed to be known to the players [13,14]. Even in this limited class, in practice, computing the Gittins index becomes intractable for many cases. For the algorithms proposed by Agrawal and Auer et al., another index was expressed as a simple function of the reward sums obtained from the machines [15,16]. In particular, the “upper confidence bound 1 (UCB1) algorithm” for solving MABs is used worldwide in many practical applications [16]. The MAB is formulated as a mathematical problem without loss of generality and, as such, is related to various stochastic phenomena. In fact, many application problems in diverse fields, such as communications (cognitive networks [17,18]), commerce (advertising on the web [19]), entertainment (Monte-Carlo tree search, which is used for computer games [20,21]), can be reduced to MABs.
A proposal was made about ten years ago for a conceptually novel switching device called the “atomic switch, ” which is based on metal ion migration and electrochemical reactions in solid electrolytes (SEs) [8,9]. Because its resistance state is controlled continuously by the movement of a limited number of metal ions/atoms, the atomic switch can be regarded as a physics-based analogcomputing element. Nanoarchitectonic designs using such atomic switches have recently been proposed for natural computing [22,23]. In this paper, using two atomic switches that interact with each other, we show that a physical constraint, the conservation law, allows for the efficient solving of decision-making problems.
This paper consists of five sections. In Sec. 2, a brief summary of an atomic-switch-based decision maker (ASDM) model is given, and its operating principle is described. The theoretical analyses of the dynamics underlying the ASDM are presented in Sec. 3. The simulation results based on the ASDM and the SOFTMAX algorithm, which is a well-known algorithm for solving the MABs [24], are compared in Sec. 4. Section 5 presents the conclusion with a short discussion.
2. Model
Recently, Kim et al. proposed a MAB solver [25,26] that reflects the behavior of a single-celled amoeboid organism (the true slime mold P. polycephalum), which maintains a constant intracellularresource volume while collecting environmental information by concurrently expanding and shrinking its pseudopod-like terminal parts. In this bio-inspired algorithm, the decision-making function is derived from its underlying “tug-of-war (TOW) game”-like dynamics. The physical constraint in TOW dynamics, the conservation law for the volume of the amoeboid body, entails a nonlocal correlation among the terminal parts. That is, a volume increment in one part is immediately compensated for by volume decrement(s) in another part(s). Owing to the nonlocal correlation, the TOW dynamics exhibit higher performance than other well-known algorithms, such as the modified ϵ-GREEDY algorithm and the modified SOFTMAX algorithm [25,26,10]. These observations suggest that efficient decision-making devices could be implemented using any physical object as long as it holds some physical conservation law. In fact, Kim et al. theoretically and experimentally demonstrated that optical energy-transfer dynamics between quantum dots, in which energy is conserved, can be used for the implementation of TOW dynamics [27,28,29].
Here, we propose a simplified model for an ASDM based on the TOW dynamics. The ASDM consists of two atomic switches (named A and B) located close to each other, as shown in Figure 1, in which a SE including metal ions is sandwiched between one Pt electrode on the top side and two Pt electrodes on the bottom side. The electrode material does not have to be Pt, but it should be an inert metal. Each atomic switch is operated in a metal/ionic conductor/metal (MIM) configuration, which can be referred to as a “gapless-type atomic switch [30]” because each MIM switch can form a metal filament between the top and bottom electrodes by precipitation on an inert electrode. In the initial state, we consider the situation where metal ions are distributed uniformly in the SE and the total number of metal ions is constant. First, a bias voltage of -V0 is applied to both switches A and B, i.e., to the bottom Pt electrodes relative to the top Pt electrode (VA and VB = - V0), and the current Ik passing through the respective switch is measured with a time step increment of ts. Under these circumstances, metal ions migrate to the respective bottom electrodes and are precipitated on them by the reduction reaction (Me+ + e- → M). Here, we assume that the same amounts of metal atoms are precipitated on both bottom electrodes, and the heights of these initial precipitations are defined by X0, as shown in Figure 1. Because of the stochastic nature of precipitation phenomena, the current Ik should fluctuat with time even after reaching the equilibrium state.
In the next procedure, the ASDM compares the measured current Ik of both switches A and B with a threshold θ. If the current Ik becomes larger than θ at a certain time step, the ASDM chooses the corresponding slot machine(s) k (A and/or B), and sends this information to the problem side. On the problem side, a reward Rk(j) generated from each “unknown PDF” (the mean reward μk is also supposed to be unknown) is obtained as a result of stochastic events by playing the machine, where j is the step number. Depending on the reward, the added voltage ΔVk(j) is determined by
and is returned to the ASDM. Here,
Rk(
j) is an arbitrary real value, and
K is a parameter that will be described in detail later on in this paper. As a result, the total voltage applied to the respective switch at the j-th step is changed to
In response to the change in the applied voltages, the height of precipitations in switches
A and
B may vary, and each displacement from the initial height
X0 at time t is given by
Xk(
t) (
k∈{A,B}). The total height becomes
X0 +
Xk(
t). This procedure is repeated for the number of steps specified by the problem side or until a metal filament in either switch is formed between the top and bottom electrodes.
During the operation, the following conditions are assumed:
1. The SE is nearly empty of metal ions to be precipitated. Owing to the conservation law of the total number of metal ions/atoms, precipitation of metal atoms (Me+ + e- → M) in one switch occurs together with the dissolution of metal atoms (M → Me+ + e-) in the other switch. This means that the height increment of one precipitation (XA or XB) is compensated by a decrement in the other. Eq.(3) represents this condition.
2. The time step ts consists of the time duration of applying the added voltage Δt and the interval time Δtint, as shown in Figure 2(b). In addition, Δt is sufficiently larger than Δtint to ignore the displacement of precipitations during Δtint. This means that once displacements of precipitations take place in Δt, this status is maintained in the subsequent Δtint after the application of ΔVk.
3. The difference in precipitation height of switch k from the (j - 1)-th step to the j-th step is proportional to ΔVk(j). For simplicity, the shape of the precipitated atoms is ignored.
Ik is also proportional to X0 + Xk(t). This assumption implies that there is also a threshold for the displacement of precipitations Th, which corresponds to the threshold θ of the current Ik. If Ik is larger than θ, the condition X0 + Xk(t) > Th is fulfilled. If the Th is set to be smaller than X0, the ASDM dynamics works from the initial state without fluctuations.
According to the foregoing assumptions, the displacement XA (= -XB) can be described by the following equations:
XA(tj+1)=QA(tj)−QB(tj)+δ(tj),
|
(3) |
Here,
Qk(
tj) (
k∈{A,B}) is an “index” for information of past experiences accumulated from the initial time 1 to the current time
tj,
Nk counts the number of times that machine
k has been played,
ΔVk is the added voltage when playing machine
k, and
δ(
tj) is an arbitrary fluctuation to which the body is subjected, and
K is a parameter. Eqs.(1) and (4) are called the “learning rule.” Consequently, the ASDM evolves according to a particularly simple rule: in addition to the fluctuation, if machine
k is played at each time
t,
Rk-
K is added to determine
Xk(
tj).
The basic behavior of the ASDM underlying TOW dynamics is illustrated in Figure 2(a). Consider that the XA of switch A is higher than Th while the XB of switch B is lower than Th at time step j. This situation corresponds to IA > θ and IB < θ. Under these circumstances, the ASDM chooses slot machine A, and a reward RA(j) is obtained by playing machine A, which is determined as a result of a stochastic event. Then, VA=-(V0 + ΔVA(j)) is applied to switch A, whereas switch B is kept VB=-V0. In the case of ΔVA(j) > 0, the precipitation of switch A is enhanced by the reduction reaction Me+ + e- → M. On the other hand, the precipitation of switch B is reduced by the oxidation reaction M → Me+ + e-, owing to the conservation law of the total number of atoms and ions. As a result, the height difference between the precipitations of switches A and B increases, as shown in Figure 2(a). Note that ΔVk(j) can take both polarities because of the stochastic event. If ΔVA(j) < 0, VA and VB are applied to decrease the height difference between the two precipitations. However, as the number of time steps increases, the ASDM finally decides to select one of switches (A or B). We illustrate schematically the expected behavior of the ASDM in Figure 2(b). Here, Th = X0 is assumed. Even under this condition, the current Ik (correspondingly Xk) fluctuates around θ (Th) with time. At each time step j, the ASDM detects the switch showing a higher current than θ (larger precipitation than Th) and selects the corresponding slot machine. The voltages applied to both switches are then updated according to the reward obtained from the slot machines. The ASDM does not always select one machine. Selection of both machines or neither of them is also possible.
3. Theoretical Analyses
Theoretical analyses of the TOW dynamics for a Bernoulli type MAB, in which a reward is limited to 0 or 1, are described in [10]. In this section, theoretical analyses of the ASDM are described for a general MAB where a reward is not limited to 0 or 1 and can take an arbitrary value.
3.1. MAB Solvability
To explore the MAB solvability of the ASDM using the learning rule Qk (Eqs.(1) and (4)), let us consider a random-walk model as shown in Fig.3(a). Here, Rk(t) (k∈{A,B}) is a reward at time t, and K is a parameter (see Eq.(1)). We assume that means of the probability density function of Rk satisfy μA > μB for simplicity. After time step t, the displacement Dk(t) (k∈{A,B}) can be described by
Dk(t)=∑Nk(t)j=1Rk(j)−KNk(t).
|
(5) |
The expected value of Dk can be obtained from the following equation:
E(Dk(t))=(μk−K)Nk(t).
|
(6) |
In the overlapping area between the two distributions shown in Fig.3(b), we cannot accurately estimate which is larger. The overlapping area should decrease as Nk increases so as to avoid incorrect judgments. This requirement can be expressed by the following forms:
These expressions can be rearranged into the form
In other words, the parameter
K must satisfy the above conditions so that the random walk correctly represents the larger judgment.
We can easily confirm that the following form, which we call K0, satisfies the above conditions:
Therefore, we can conclude that the ASDM using the learning rule
Qk with the parameter
K0 can solve the MAB correctly.
3.2. Origin of the high performance
In many popular algorithms such as the ϵ-GREEDY algorithm [24], at each time t, an estimate of reward probability is updated for either of the two machines being played. On the other hand, in an imaginary circumstance in which the sum of the mean rewards γ=μA+μB is known to the player, we can update both of the two estimates simultaneously, even though only one of the machines was played.
The top and bottom rows of Table 1 provide estimates based on the knowledge that machine A was played NA times and that machine B was played NB times, respectively. Note that we can also update the estimate of the machine that was not played, owing to the given γ.
Table 1. Estimates for each mean reward based on the knowledge that machine A was played NA times and that machine B was played NB times—on the assumption that the sum of the mean rewards γ=μA+μB is known.
A: | ∑NAj=1RA(j)NA | B: | γ−∑NAj=1RA(j)NA |
A: | γ−∑NBj=1RB(j)NB | B: | ∑NBj=1RB(j)NB |
From the above estimates, each expected reward Q′k (k∈{A,B}) is given as follows:
Q′A=NA∑NAj=1RA(j)NA+NB(γ−∑NBj=1RB(j)NB)=∑NAj=1RA(j)−∑NBj=1RB(j)+γNB,
|
(12) |
Q′B=NA(γ−∑NAj=1RA(j)NA)+NB∑NBj=1RB(j)NB=∑NBj=1RB(j)−∑NAj=1RA(j)+γNA.
|
(13) |
These expected rewards,
Q′js, are not the same as those given by the learning rules of TOW dynamics,
Qjs in Eqs.(1) and (4). However, what we use substantially in TOW dynamics is the difference
QA−QB=(∑NAj=1RA(j)−∑NBj=1RB(j))−K(NA−NB).
|
(14) |
When we transform the expected rewards
Q′js into
Q′′j=Q′j/2, we can obtain the difference
Q′′A−Q′′B=(∑NAj=1RA(j)−∑NBj=1RB(j))−γ2(NA−NB).
|
(15) |
Comparing the coefficients of Eqs.(14) and (15), the two differences are always equal when
K =
K0 (Eq.(10)) is satisfied. Eventually, we can obtain the nearly optimal weighting parameter
K0 in terms of
γ.
This derivation implies that the learning rule for the ASDM is equivalent to that of the imaginary system in which both of the two estimates can be updated simultaneously. In other words, the ASDM imitates the imaginary system that determines its next move at time t + 1 in referring to the estimates of the two machines, even if one of them was not actually played at time t. This unique feature in the learning rule, derived from the fact that the sum of mean rewards is given in advance, may be one of the origins of the high performance of the ASDM.
Monte Carlo simulations were performed it was verified that the ASDM with K0 exhibits an exceptionally high performance, which is comparable to its peak performance—achieved with the optimal parameter Kopt. To derive the optimal value Kopt accurately, we need to take into account the fluctuations.
In addition, the essence of the process described here can be generalized to M-machine cases. To separate distributions of the top m-th and top (m + 1)-th machine, as shown in Fig.3(b), all we need is the following K0:
Here,
μ(
m) denotes the top
m-th mean, and m is any integer from 1 to
M - 1. The MBP is a special case where
m = 1. In fact, for
M-machine and
X-player cases, we have designed a physical system that can determine the overall optimal state, called the “social maximum
[31,32], ” quickly and accurately
[33,34].
3.3. Performance characteristics
In this section, we calculate a performance measure, called the “regret, ” to characterize the high performance of the ASDM. We consider the “cheater algorithm, ” an imaginary model for solving the MAB, because the regret of this algorithm can be easily calculated.
The cheater algorithm selects a machine to play according to the following estimate Sk (k∈{A,B})
SA=XA,1+XA,2+⋯+XA,N,
|
(18) |
SB=XB,1+XB,2+⋯+XB,N.
|
(19) |
Here,
Xk,i is a random variable. If
SA >
SB at time
t =
N, machine
A is played at time
t =
N + 1. If
SB >
SA at time
t =
N, machine
B is played at time
t =
N + 1. If
SA =
SB at time
t =
N, a machine is played randomly at time
t =
N + 1. Note that the algorithm refers to results of both machines at time
t without any attention to which machine was played at time
t - 1. In other words, the algorithm “cheats” because it plays both machines and collects both results, but declares that it plays only one machine at a time.
The expected value and the variance of Xk are defined as E(Xk) = μk and V(Xk) = σ2k. Here, μk is the same as the Pk defined earlier. From the central-limit theorem, S k has a Gaussian distribution with E(S k) = μkN and V(S k) = σ2kN. If we define a new variable S = S A - S B, S has a Gaussian distribution and carries the following values:
From Fig.4, the probability of playing machine B, which has a lower reward probability, can be described as Q(E(S)σ(S)). Here, Q(x) is a Q-function. We obtain
Here,
Using the Chernoff bound Q(x)≤12exp(−x22), we can calculate the upper bound of a measure, called the “regret, ” which quantifies the accumulated losses of the algorithm.
regret=(μA−μB)E(NB).
|
(25) |
E(NB)=N−1∑t=0Q(ϕ√t)≤N−1∑t=012exp(−ϕ22t)=12+N−1∑t=112exp(−ϕ22t)≤12+∫N−1012exp(−ϕ22t)dt=12−1ϕ2(exp(−ϕ22(N−1))−1)
|
(26) |
Note that the regret becomes constant as
N increases.
Using the “cheated” results, we can also calculate the regret for the ASDM in the same way. In this case,
SA=XA,1+XA,2+⋯+XA,NA−KNA,
|
(28) |
SB=XB,1+XB,2+⋯+XB,NB−KNB.
|
(29) |
Xk,i is also a random variable. Then, we obtain
Using the new variables
S =
S A -
S B,
N =
NA +
NN, and
D =
NA -
NN, we also obtain
E(S)=μA−μB2N+(μA+μB2−K)D,
|
(32) |
V(S)=σ2A+σ2B2N+σ2A−σ2B2D.
|
(33) |
If the conditions K = K0 and σA=σB≡σ are satisfied, we then obtain
and
Here,
We can then calculate the upper bound of the regret for the ASDM
E(NB)=N−1∑t=0Q(ϕT√t)≤12−1ϕ2T(exp(−ϕ2T2(N−1))−1)
|
(38) |
Note that the regret for the ASDM also becomes constant as
N increases.
It is well known that optimal algorithms for the MAB, defined by Auer et al. [16], have a regret proportional to log(N). The regret for the optimal algorithms has no finite upper bound as N increases because it continues to require playing the lower-reward machine to ensure that the probability of incorrect judgment goes to zero. A constant regret for the ASDM means that the probability of incorrect judgment remains non-zero, although this probability is nearly equal to zero. However, it would appear that the reward probabilities change frequently in actual decision-making situations, and their longterm behavior is not important for many practical purposes. For this reason, the ASDM would be more suited to real-world applications.
4. Simulation Results
In this section, to show the effectiveness of the ASDM, we investigate the performance comparison between the ASDM and the SOFTMAX algorithm which is a well-known algorithm for efficient decision-making [24] (see Appendix). From computer simulations, we confirmed that, in almost all cases, an ASDM with the parameter K0 (= μA+μB2) can acquire more rewards than a SOFTMAX algorithm with the optimized parameter τopt, although SOFTMAX is well known as a good algorithm [35]. Here, the parameter K0 is nearly optimal as shown in Fig.5(a). Regret, as defined in the previous section, is a performance measure where a lower value indicates higher performance (more rewards). Figure 5(b) shows an ASDM/SOFTMAX performance comparison. The vertical axis denotes the regret (mean values of 1000 samples), and the horizontal axis denotes the number of plays. The blue dotted line denotes the upper bound of the ASDM with K0 (Eq.(39)). For the reward PDFs, we used normal distributions N(μA,σ2) and N(μB,σ2), where μA = 0.6, μB = 0.5, and σ = 0.2. Computer simulations were executed under the condition that Th = X0 and δ=sin(π/2+πt).
5. Discussion and Conclusion
In this study, we proposed an ASDM for solving the MAB, and analytically validated their high efficiency in making decisions. In conventional decision-making algorithms for solving MABs, the parameter for adjusting the “exploration time” must be optimized. This exploration parameter always reflects the difference between the rewarded experiences, i.e., |μA−μB|. In contrast, the ASDM demonstrates that higher performance can be achieved by introducing a parameter K0 that refers to the sum of the rewarded experiences, i.e., μA + μB. This type of optimization, using the sum of the rewarded experiences, is particularly useful for time varying environments (reward probability or reward PDF) [26]. Owing to this novelty, the high performance of the TOW dynamics can be reproduced when implementing these dynamics with atomic switches.
The ASDM proposed in this paper is a simple “ideal model.” While the assumptions used for constructing the model may contain some points that do not match real experimental situations, we can more accurately extend the model so that the modified assumptions do match real experimental situations. As we extend the model to treat more than two options, it may be found that there are some experimental limitations in implementing TOW dynamics when more than two atomic switches are used. As long as the TOW dynamics between atomic switches is implemented, high performance decision-making can be guaranteed even in the extended model.
The ASDM will introduce a new physics-based analog-computing paradigm, which will include
such things as “intelligent nanodevices” based on self-judgment. Thus, our proposed physics-based
analog-computing paradigm would be useful for a variety of real-world applications and for under-standing the biological information-processing principles that exploit their underlying physics.
Appendix
SOFTMAX algorithm
The SOFTMAX algorithm is a well-known algorithm for solving MABs [24]. In this algorithm, the probability of selecting A or B, P′A(t) or P′B(t), is given by the following Boltzmann distributions:
P′A(t)=exp[β⋅QA(t)]exp[β⋅QA(t)]+exp[β⋅QB(t)],
|
(40) |
P′B(t)=exp[β⋅QB(t)]exp[β⋅QA(t)]+exp[β⋅QB(t)],
|
(41) |
where
Qk(
t) (
k∈{A,B}) is given by
∑Nk(t)j=1Rk(j)Nk(t). Here,
β is a time-dependent form in our study, as follows:
β = 0 corresponds to a random selection, and
β → 1 corresponds to a greedy action.
Acknowledgments
The authors thank Dr. Etsushi Nameda and Prof. Masahiko Hara for valuable discussions in the early stages of this work. This work was supported in part by the Sekisui Chemical Grant Program for “Research on Manufacturing Based on Innovations Inspired by Nature.”
Conflict of Interest
The authors declare that there is no conflicting of interest regarding the publication of this paper.