Loading [MathJax]/jax/output/SVG/jax.js
Research article Special Issues

Decision maker based on atomic switches

  • We propose a simple model for an atomic switch-based decision maker (ASDM), and show that, as long as its total number of metal atoms is conserved when coupled with suitable operations, an atomic switch system provides a sophisticated ``decision-making'' capability that is known to be one of the most important intellectual abilities in human beings. We considered a popular decision-making problem studied in the context of reinforcement learning, the multi-armed bandit problem (MAB); the problem of finding, as accurately and quickly as possible, the most profitable option from a set of options that gives stochastic rewards. These decisions are made as dictated by each volume of precipitated metal atoms, which is moved in a manner similar to the fluctuations of a rigid body in a tug-of-war game. The ``tug-of-war (TOW) dynamics'' of the ASDM exhibits higher efficiency than conventional reinforcement-learning algorithms. We show analytical calculations that validate the statistical reasons for the ASDM to produce such high performance, despite its simplicity. Efficient MAB solvers are useful for many practical applications, because MAB abstracts a variety of decision-making problems in real-world situations where an efficient trial-and-error is required. The proposed scheme will open up a new direction in physics-based analog-computing paradigms, which will include such things as ``intelligent nanodevices'' based on self-judgment.

    Citation: Song-Ju Kim, Tohru Tsuruoka, Tsuyoshi Hasegawa, Masashi Aono, Kazuya Terabe, Masakazu Aono. Decision maker based on atomic switches[J]. AIMS Materials Science, 2016, 3(1): 245-259. doi: 10.3934/matersci.2016.1.245

    Related Papers:

    [1] Jens Bürger, Alireza Goudarzi, Darko Stefanovic, Christof Teuscher . Computational capacity and energy consumption of complex resistive switch networks. AIMS Materials Science, 2015, 2(4): 530-545. doi: 10.3934/matersci.2015.4.530
    [2] Grujicic Mica, Ramaswami S., S. Snipes J., Yavari R. . Multi-Scale Computation-Based Design of Nano-Segregated Polyurea for Maximum Shockwave-Mitigation Performance. AIMS Materials Science, 2014, 1(1): 15-27. doi: 10.3934/matersci.2014.1.15
    [3] Koufos Evan, Muralidharan Bharatram, Dutt Meenakshi . Computational Design of Multi-component Bio-Inspired Bilayer Membranes. AIMS Materials Science, 2014, 1(2): 103-120. doi: 10.3934/matersci.2014.2.103
    [4] Mateo Duarte, Andrés Benítez, Katiuska Gómez, Benjamín Zuluaga D, Juan Meza, Yamile Cardona-Maya, Juan S. Rudas, César Isaza . Nanomechanical characterization of a metal matrix composite reinforced with carbon nanotubes. AIMS Materials Science, 2020, 7(1): 33-45. doi: 10.3934/matersci.2020.1.33
    [5] Szymon Biernat, Adam Wojciech Bydałek . Determining the density of metals based on their atomic construction using the theoretical model. AIMS Materials Science, 2019, 6(5): 748-755. doi: 10.3934/matersci.2019.5.748
    [6] Jean-Louis Bretonnet . Basics of the density functional theory. AIMS Materials Science, 2017, 4(6): 1372-1405. doi: 10.3934/matersci.2017.6.1372
    [7] Laura J. Weiser, Erik E. Santiso . Molecular modeling studies of peptoid polymers. AIMS Materials Science, 2017, 4(5): 1029-1051. doi: 10.3934/matersci.2017.5.1029
    [8] Hang Meng, Shihao Huang, Yifeng Jiang . The role of oxygen vacancies on resistive switching properties of oxide materials. AIMS Materials Science, 2020, 7(5): 665-683. doi: 10.3934/matersci.2020.5.665
    [9] Julian Konrad, Dirk Zahn . Assessing the mechanical properties of molecular materials from atomic simulation. AIMS Materials Science, 2021, 8(6): 867-880. doi: 10.3934/matersci.2021053
    [10] Zoubir Chaieb, Ould Mohamed Ouarda, Azzeddine Abderrahmane Raho, Mouhyddine Kadi-Hanifi . Effect of Fe and Si impurities on the precipitation kinetics of the GPB zones in the Al-3wt%Cu-1wt%Mg alloy. AIMS Materials Science, 2016, 3(4): 1443-1455. doi: 10.3934/matersci.2016.4.1443
  • We propose a simple model for an atomic switch-based decision maker (ASDM), and show that, as long as its total number of metal atoms is conserved when coupled with suitable operations, an atomic switch system provides a sophisticated ``decision-making'' capability that is known to be one of the most important intellectual abilities in human beings. We considered a popular decision-making problem studied in the context of reinforcement learning, the multi-armed bandit problem (MAB); the problem of finding, as accurately and quickly as possible, the most profitable option from a set of options that gives stochastic rewards. These decisions are made as dictated by each volume of precipitated metal atoms, which is moved in a manner similar to the fluctuations of a rigid body in a tug-of-war game. The ``tug-of-war (TOW) dynamics'' of the ASDM exhibits higher efficiency than conventional reinforcement-learning algorithms. We show analytical calculations that validate the statistical reasons for the ASDM to produce such high performance, despite its simplicity. Efficient MAB solvers are useful for many practical applications, because MAB abstracts a variety of decision-making problems in real-world situations where an efficient trial-and-error is required. The proposed scheme will open up a new direction in physics-based analog-computing paradigms, which will include such things as ``intelligent nanodevices'' based on self-judgment.


    1. Introduction

    Many natural phenomena, including the physical, chemical, and biological, can be viewed as computing processes [1,2,3]. Inspired by such natural phenomena, many search algorithms for combinatorial optimization problems have been proposed to quickly obtain high quality solutions, such as simulated annealing [4], neural networks [5], genetic algorithms [6], and DNA computing [7]. In this paper, we propose a new computing architecture that utilizes the desirable physical properties of “atomic switches” [8,9] to solve a decision-making problem. If this architecture could be implemented in physical devices in such a way that their computing processes are elegantly coupled with their underlying physics, we would be able to utilize their abilities to make accurate and speedy decisions in uncertain environments [10].

    Decision-making is one of the most important intellectual abilities of human beings. In the context of reinforcement learning, the multi-armed bandit problem (MAB) was originally described by Robbins [11], although the essence of the problem had been studied earlier by Thompson [12]. Suppose there are M slot machines, each of which returns a reward; for example, coins, with a certain probability density function (PDF) that is unknown to a player. Let us consider a minimal case: two machines A and/or B give rewards with individual PDF whose mean reward is μA and μB, respectively. The player makes a decision on which machine to play at each trial, trying to maximize the total reward obtained after repeating several trials. The MAB is used to determine the optimal strategy for playing machines as accurately and quickly as possible by referring to past experience.

    The optimal strategy, called the “Gittins index, ” is known only for a limited class of problems in which the reward distributions are assumed to be known to the players [13,14]. Even in this limited class, in practice, computing the Gittins index becomes intractable for many cases. For the algorithms proposed by Agrawal and Auer et al., another index was expressed as a simple function of the reward sums obtained from the machines [15,16]. In particular, the “upper confidence bound 1 (UCB1) algorithm” for solving MABs is used worldwide in many practical applications [16]. The MAB is formulated as a mathematical problem without loss of generality and, as such, is related to various stochastic phenomena. In fact, many application problems in diverse fields, such as communications (cognitive networks [17,18]), commerce (advertising on the web [19]), entertainment (Monte-Carlo tree search, which is used for computer games [20,21]), can be reduced to MABs.

    A proposal was made about ten years ago for a conceptually novel switching device called the “atomic switch, ” which is based on metal ion migration and electrochemical reactions in solid electrolytes (SEs) [8,9]. Because its resistance state is controlled continuously by the movement of a limited number of metal ions/atoms, the atomic switch can be regarded as a physics-based analogcomputing element. Nanoarchitectonic designs using such atomic switches have recently been proposed for natural computing [22,23]. In this paper, using two atomic switches that interact with each other, we show that a physical constraint, the conservation law, allows for the efficient solving of decision-making problems.

    This paper consists of five sections. In Sec. 2, a brief summary of an atomic-switch-based decision maker (ASDM) model is given, and its operating principle is described. The theoretical analyses of the dynamics underlying the ASDM are presented in Sec. 3. The simulation results based on the ASDM and the SOFTMAX algorithm, which is a well-known algorithm for solving the MABs [24], are compared in Sec. 4. Section 5 presents the conclusion with a short discussion.

    2. Model

    Recently, Kim et al. proposed a MAB solver [25,26] that reflects the behavior of a single-celled amoeboid organism (the true slime mold P. polycephalum), which maintains a constant intracellularresource volume while collecting environmental information by concurrently expanding and shrinking its pseudopod-like terminal parts. In this bio-inspired algorithm, the decision-making function is derived from its underlying “tug-of-war (TOW) game”-like dynamics. The physical constraint in TOW dynamics, the conservation law for the volume of the amoeboid body, entails a nonlocal correlation among the terminal parts. That is, a volume increment in one part is immediately compensated for by volume decrement(s) in another part(s). Owing to the nonlocal correlation, the TOW dynamics exhibit higher performance than other well-known algorithms, such as the modified ϵ-GREEDY algorithm and the modified SOFTMAX algorithm [25,26,10]. These observations suggest that efficient decision-making devices could be implemented using any physical object as long as it holds some physical conservation law. In fact, Kim et al. theoretically and experimentally demonstrated that optical energy-transfer dynamics between quantum dots, in which energy is conserved, can be used for the implementation of TOW dynamics [27,28,29].

    Here, we propose a simplified model for an ASDM based on the TOW dynamics. The ASDM consists of two atomic switches (named A and B) located close to each other, as shown in Figure 1, in which a SE including metal ions is sandwiched between one Pt electrode on the top side and two Pt electrodes on the bottom side. The electrode material does not have to be Pt, but it should be an inert metal. Each atomic switch is operated in a metal/ionic conductor/metal (MIM) configuration, which can be referred to as a “gapless-type atomic switch [30]” because each MIM switch can form a metal filament between the top and bottom electrodes by precipitation on an inert electrode. In the initial state, we consider the situation where metal ions are distributed uniformly in the SE and the total number of metal ions is constant. First, a bias voltage of -V0 is applied to both switches A and B, i.e., to the bottom Pt electrodes relative to the top Pt electrode (VA and VB = - V0), and the current Ik passing through the respective switch is measured with a time step increment of ts. Under these circumstances, metal ions migrate to the respective bottom electrodes and are precipitated on them by the reduction reaction (Me+ + e-M). Here, we assume that the same amounts of metal atoms are precipitated on both bottom electrodes, and the heights of these initial precipitations are defined by X0, as shown in Figure 1. Because of the stochastic nature of precipitation phenomena, the current Ik should fluctuat with time even after reaching the equilibrium state.

    Figure 1. The ASDM using gapless-type atomic switches. The ASDM decides which machine (A or/and B) is to be played at time t according to whether the current Ik is larger than θ or not.

    In the next procedure, the ASDM compares the measured current Ik of both switches A and B with a threshold θ. If the current Ik becomes larger than θ at a certain time step, the ASDM chooses the corresponding slot machine(s) k (A and/or B), and sends this information to the problem side. On the problem side, a reward Rk(j) generated from each “unknown PDF” (the mean reward μk is also supposed to be unknown) is obtained as a result of stochastic events by playing the machine, where j is the step number. Depending on the reward, the added voltage ΔVk(j) is determined by

    ΔVk(j)=Rk(j)K, (1)
    and is returned to the ASDM. Here, Rk(j) is an arbitrary real value, and K is a parameter that will be described in detail later on in this paper. As a result, the total voltage applied to the respective switch at the j-th step is changed to
    Vk=(V0+ΔVk(j)). (2)
    In response to the change in the applied voltages, the height of precipitations in switches A and B may vary, and each displacement from the initial height X0 at time t is given by Xk(t) (k{A,B}). The total height becomes X0 + Xk(t). This procedure is repeated for the number of steps specified by the problem side or until a metal filament in either switch is formed between the top and bottom electrodes.

    During the operation, the following conditions are assumed:

    1. The SE is nearly empty of metal ions to be precipitated. Owing to the conservation law of the total number of metal ions/atoms, precipitation of metal atoms (Me+ + e- → M) in one switch occurs together with the dissolution of metal atoms (MMe+ + e-) in the other switch. This means that the height increment of one precipitation (XA or XB) is compensated by a decrement in the other. Eq.(3) represents this condition.

    2. The time step ts consists of the time duration of applying the added voltage Δt and the interval time Δtint, as shown in Figure 2(b). In addition, Δt is sufficiently larger than Δtint to ignore the displacement of precipitations during Δtint. This means that once displacements of precipitations take place in Δt, this status is maintained in the subsequent Δtint after the application of ΔVk.

    Figure 2. (a) Tug-of-war (TOW) dynamics in the ASDM. (b) An example of expected behavior of the ASDM, showing the relationship between voltage Vk and displacement Xk. Here, added voltage ΔVk(j) is determined by each reward Rk(j) (Eq.(1)) at play j (Ik > θ). The ASDM selected machine A, A, A, B, … in this figure.

    3. The difference in precipitation height of switch k from the (j - 1)-th step to the j-th step is proportional to ΔVk(j). For simplicity, the shape of the precipitated atoms is ignored.

    Ik is also proportional to X0 + Xk(t). This assumption implies that there is also a threshold for the displacement of precipitations Th, which corresponds to the threshold θ of the current Ik. If Ik is larger than θ, the condition X0 + Xk(t) > Th is fulfilled. If the Th is set to be smaller than X0, the ASDM dynamics works from the initial state without fluctuations.

    According to the foregoing assumptions, the displacement XA (= -XB) can be described by the following equations:

    XA(tj+1)=QA(tj)QB(tj)+δ(tj), (3)
    Qk(tj)=Nkj=1ΔVk(j). (4)
    Here, Qk(tj) (k{A,B}) is an “index” for information of past experiences accumulated from the initial time 1 to the current time tj, Nk counts the number of times that machine k has been played, ΔVk is the added voltage when playing machine k, and δ(tj) is an arbitrary fluctuation to which the body is subjected, and K is a parameter. Eqs.(1) and (4) are called the “learning rule.” Consequently, the ASDM evolves according to a particularly simple rule: in addition to the fluctuation, if machine k is played at each time t, Rk-K is added to determine Xk(tj).

    The basic behavior of the ASDM underlying TOW dynamics is illustrated in Figure 2(a). Consider that the XA of switch A is higher than Th while the XB of switch B is lower than Th at time step j. This situation corresponds to IA > θ and IB < θ. Under these circumstances, the ASDM chooses slot machine A, and a reward RA(j) is obtained by playing machine A, which is determined as a result of a stochastic event. Then, VA=-(V0 + ΔVA(j)) is applied to switch A, whereas switch B is kept VB=-V0. In the case of ΔVA(j) > 0, the precipitation of switch A is enhanced by the reduction reaction Me+ + e-M. On the other hand, the precipitation of switch B is reduced by the oxidation reaction MMe+ + e-, owing to the conservation law of the total number of atoms and ions. As a result, the height difference between the precipitations of switches A and B increases, as shown in Figure 2(a). Note that ΔVk(j) can take both polarities because of the stochastic event. If ΔVA(j) < 0, VA and VB are applied to decrease the height difference between the two precipitations. However, as the number of time steps increases, the ASDM finally decides to select one of switches (A or B). We illustrate schematically the expected behavior of the ASDM in Figure 2(b). Here, Th = X0 is assumed. Even under this condition, the current Ik (correspondingly Xk) fluctuates around θ (Th) with time. At each time step j, the ASDM detects the switch showing a higher current than θ (larger precipitation than Th) and selects the corresponding slot machine. The voltages applied to both switches are then updated according to the reward obtained from the slot machines. The ASDM does not always select one machine. Selection of both machines or neither of them is also possible.

    3. Theoretical Analyses

    Theoretical analyses of the TOW dynamics for a Bernoulli type MAB, in which a reward is limited to 0 or 1, are described in [10]. In this section, theoretical analyses of the ASDM are described for a general MAB where a reward is not limited to 0 or 1 and can take an arbitrary value.

    3.1. MAB Solvability

    To explore the MAB solvability of the ASDM using the learning rule Qk (Eqs.(1) and (4)), let us consider a random-walk model as shown in Fig.3(a). Here, Rk(t) (k{A,B}) is a reward at time t, and K is a parameter (see Eq.(1)). We assume that means of the probability density function of Rk satisfy μA > μB for simplicity. After time step t, the displacement Dk(t) (k{A,B}) can be described by

    Dk(t)=Nk(t)j=1Rk(j)KNk(t). (5)
    The expected value of Dk can be obtained from the following equation:
    E(Dk(t))=(μkK)Nk(t). (6)

    Figure 3. (a) Random walk: flight Rk(t) - K. Here, the probability density function of Rk has the mean μk. (b) Probability distributions of two random walks.

    In the overlapping area between the two distributions shown in Fig.3(b), we cannot accurately estimate which is larger. The overlapping area should decrease as Nk increases so as to avoid incorrect judgments. This requirement can be expressed by the following forms:

    μAK>0, (7)
    μBK<0. (8)
    These expressions can be rearranged into the form
    μB<K<μA. (9)
    In other words, the parameter K must satisfy the above conditions so that the random walk correctly represents the larger judgment.

    We can easily confirm that the following form, which we call K0, satisfies the above conditions:

    K0=γ2, (10)
    γ=μA+μB. (11)
    Therefore, we can conclude that the ASDM using the learning rule Qk with the parameter K0 can solve the MAB correctly.

    3.2. Origin of the high performance

    In many popular algorithms such as the ϵ-GREEDY algorithm [24], at each time t, an estimate of reward probability is updated for either of the two machines being played. On the other hand, in an imaginary circumstance in which the sum of the mean rewards γ=μA+μB is known to the player, we can update both of the two estimates simultaneously, even though only one of the machines was played.

    The top and bottom rows of Table 1 provide estimates based on the knowledge that machine A was played NA times and that machine B was played NB times, respectively. Note that we can also update the estimate of the machine that was not played, owing to the given γ.

    Table 1. Estimates for each mean reward based on the knowledge that machine A was played NA times and that machine B was played NB times—on the assumption that the sum of the mean rewards γ=μA+μB is known.
    A:NAj=1RA(j)NAB:γNAj=1RA(j)NA
    A:γNBj=1RB(j)NBB:NBj=1RB(j)NB
     | Show Table
    DownLoad: CSV

    From the above estimates, each expected reward Qk (k{A,B}) is given as follows:

    QA=NANAj=1RA(j)NA+NB(γNBj=1RB(j)NB)=NAj=1RA(j)NBj=1RB(j)+γNB, (12)
    QB=NA(γNAj=1RA(j)NA)+NBNBj=1RB(j)NB=NBj=1RB(j)NAj=1RA(j)+γNA. (13)
    These expected rewards, Qjs, are not the same as those given by the learning rules of TOW dynamics, Qjs in Eqs.(1) and (4). However, what we use substantially in TOW dynamics is the difference
    QAQB=(NAj=1RA(j)NBj=1RB(j))K(NANB). (14)
    When we transform the expected rewards Qjs into Qj=Qj/2, we can obtain the difference
    QAQB=(NAj=1RA(j)NBj=1RB(j))γ2(NANB). (15)
    Comparing the coefficients of Eqs.(14) and (15), the two differences are always equal when K = K0 (Eq.(10)) is satisfied. Eventually, we can obtain the nearly optimal weighting parameter K0 in terms of γ.

    This derivation implies that the learning rule for the ASDM is equivalent to that of the imaginary system in which both of the two estimates can be updated simultaneously. In other words, the ASDM imitates the imaginary system that determines its next move at time t + 1 in referring to the estimates of the two machines, even if one of them was not actually played at time t. This unique feature in the learning rule, derived from the fact that the sum of mean rewards is given in advance, may be one of the origins of the high performance of the ASDM.

    Monte Carlo simulations were performed it was verified that the ASDM with K0 exhibits an exceptionally high performance, which is comparable to its peak performance—achieved with the optimal parameter Kopt. To derive the optimal value Kopt accurately, we need to take into account the fluctuations.

    In addition, the essence of the process described here can be generalized to M-machine cases. To separate distributions of the top m-th and top (m + 1)-th machine, as shown in Fig.3(b), all we need is the following K0:

    K0=γ2, (16)
    γ=μ(m)+μ(m+1). (17)
    Here, μ(m) denotes the top m-th mean, and m is any integer from 1 to M - 1. The MBP is a special case where m = 1. In fact, for M-machine and X-player cases, we have designed a physical system that can determine the overall optimal state, called the “social maximum [31,32], ” quickly and accurately [33,34].

    3.3. Performance characteristics

    In this section, we calculate a performance measure, called the “regret, ” to characterize the high performance of the ASDM. We consider the “cheater algorithm, ” an imaginary model for solving the MAB, because the regret of this algorithm can be easily calculated.

    The cheater algorithm selects a machine to play according to the following estimate Sk (k{A,B})

    SA=XA,1+XA,2++XA,N, (18)
    SB=XB,1+XB,2++XB,N. (19)
    Here, Xk,i is a random variable. If SA > SB at time t = N, machine A is played at time t = N + 1. If SB > SA at time t = N, machine B is played at time t = N + 1. If SA = SB at time t = N, a machine is played randomly at time t = N + 1. Note that the algorithm refers to results of both machines at time t without any attention to which machine was played at time t - 1. In other words, the algorithm “cheats” because it plays both machines and collects both results, but declares that it plays only one machine at a time.

    The expected value and the variance of Xk are defined as E(Xk) = μk and V(Xk) = σ2k. Here, μk is the same as the Pk defined earlier. From the central-limit theorem, S k has a Gaussian distribution with E(S k) = μkN and V(S k) = σ2kN. If we define a new variable S = S A - S B, S has a Gaussian distribution and carries the following values:

    E(S)=(μA+μB)N, (20)
    V(S)=(σ2A+σ2B)N, (21)
    σ(S)=σ2A+σ2BN. (22)

    From Fig.4, the probability of playing machine B, which has a lower reward probability, can be described as Q(E(S)σ(S)). Here, Q(x) is a Q-function. We obtain

    P(t=N+1,B)=Q(ϕN). (23)
    Here,
    ϕ=μAμBσ2A+σ2B. (24)

    Figure 4. Q(E(S)σ(S)): probability of selecting the lower-reward machine in the cheater algorithm.

    Using the Chernoff bound Q(x)12exp(x22), we can calculate the upper bound of a measure, called the “regret, ” which quantifies the accumulated losses of the algorithm.

    regret=(μAμB)E(NB). (25)
    E(NB)=N1t=0Q(ϕt)N1t=012exp(ϕ22t)=12+N1t=112exp(ϕ22t)12+N1012exp(ϕ22t)dt=121ϕ2(exp(ϕ22(N1))1) (26)
    12+1ϕ2. (27)
    Note that the regret becomes constant as N increases.

    Using the “cheated” results, we can also calculate the regret for the ASDM in the same way. In this case,

    SA=XA,1+XA,2++XA,NAKNA, (28)
    SB=XB,1+XB,2++XB,NBKNB. (29)
    Xk,i is also a random variable. Then, we obtain
    E(Sk)=(μkK)Nk, (30)
    V(Sk)=σ2kNk. (31)
    Using the new variables S = S A - S B, N = NA + NN, and D = NA - NN, we also obtain
    E(S)=μAμB2N+(μA+μB2K)D, (32)
    V(S)=σ2A+σ2B2N+σ2Aσ2B2D. (33)

    If the conditions K = K0 and σA=σBσ are satisfied, we then obtain

    E(S)=μAμB2N, (34)
    V(S)=σ2N, (35)
    and
    P(t=N+1,B)=Q(ϕTN). (36)
    Here,
    ϕT=μAμB2σ. (37)

    We can then calculate the upper bound of the regret for the ASDM

    E(NB)=N1t=0Q(ϕTt)121ϕ2T(exp(ϕ2T2(N1))1) (38)
    12+1ϕ2T. (39)
    Note that the regret for the ASDM also becomes constant as N increases.

    It is well known that optimal algorithms for the MAB, defined by Auer et al. [16], have a regret proportional to log(N). The regret for the optimal algorithms has no finite upper bound as N increases because it continues to require playing the lower-reward machine to ensure that the probability of incorrect judgment goes to zero. A constant regret for the ASDM means that the probability of incorrect judgment remains non-zero, although this probability is nearly equal to zero. However, it would appear that the reward probabilities change frequently in actual decision-making situations, and their longterm behavior is not important for many practical purposes. For this reason, the ASDM would be more suited to real-world applications.

    4. Simulation Results

    In this section, to show the effectiveness of the ASDM, we investigate the performance comparison between the ASDM and the SOFTMAX algorithm which is a well-known algorithm for efficient decision-making [24] (see Appendix). From computer simulations, we confirmed that, in almost all cases, an ASDM with the parameter K0 (= μA+μB2) can acquire more rewards than a SOFTMAX algorithm with the optimized parameter τopt, although SOFTMAX is well known as a good algorithm [35]. Here, the parameter K0 is nearly optimal as shown in Fig.5(a). Regret, as defined in the previous section, is a performance measure where a lower value indicates higher performance (more rewards). Figure 5(b) shows an ASDM/SOFTMAX performance comparison. The vertical axis denotes the regret (mean values of 1000 samples), and the horizontal axis denotes the number of plays. The blue dotted line denotes the upper bound of the ASDM with K0 (Eq.(39)). For the reward PDFs, we used normal distributions N(μA,σ2) and N(μB,σ2), where μA = 0.6, μB = 0.5, and σ = 0.2. Computer simulations were executed under the condition that Th = X0 and δ=sin(π/2+πt).

    Figure 5. (a) The regret of the ASDM with K until 1, 000 plays (black solid line). The red dashed line denotes the regret of SOFTMAX with τopt = 0.3 (b) Performance comparison between the ASDM (black solid line) and SOFTMAX [24] (red dashed line). We used K0 = 0.55 for the ASDM and τopt = 0.3 for SOFTMAX. The blue dotted line denotes the upper bound of the ASDM with K0 (Eq.(39)).

    5. Discussion and Conclusion

    In this study, we proposed an ASDM for solving the MAB, and analytically validated their high efficiency in making decisions. In conventional decision-making algorithms for solving MABs, the parameter for adjusting the “exploration time” must be optimized. This exploration parameter always reflects the difference between the rewarded experiences, i.e., |μAμB|. In contrast, the ASDM demonstrates that higher performance can be achieved by introducing a parameter K0 that refers to the sum of the rewarded experiences, i.e., μA + μB. This type of optimization, using the sum of the rewarded experiences, is particularly useful for time varying environments (reward probability or reward PDF) [26]. Owing to this novelty, the high performance of the TOW dynamics can be reproduced when implementing these dynamics with atomic switches.

    The ASDM proposed in this paper is a simple “ideal model.” While the assumptions used for constructing the model may contain some points that do not match real experimental situations, we can more accurately extend the model so that the modified assumptions do match real experimental situations. As we extend the model to treat more than two options, it may be found that there are some experimental limitations in implementing TOW dynamics when more than two atomic switches are used. As long as the TOW dynamics between atomic switches is implemented, high performance decision-making can be guaranteed even in the extended model.

    The ASDM will introduce a new physics-based analog-computing paradigm, which will include such things as “intelligent nanodevices” based on self-judgment. Thus, our proposed physics-based analog-computing paradigm would be useful for a variety of real-world applications and for under-standing the biological information-processing principles that exploit their underlying physics.

    Appendix

    SOFTMAX algorithm

    The SOFTMAX algorithm is a well-known algorithm for solving MABs [24]. In this algorithm, the probability of selecting A or B, PA(t) or PB(t), is given by the following Boltzmann distributions:

    PA(t)=exp[βQA(t)]exp[βQA(t)]+exp[βQB(t)], (40)
    PB(t)=exp[βQB(t)]exp[βQA(t)]+exp[βQB(t)], (41)
    where Qk(t) (k{A,B}) is given by Nk(t)j=1Rk(j)Nk(t). Here, β is a time-dependent form in our study, as follows:
    β(t)=τt (42)
    β = 0 corresponds to a random selection, and β → 1 corresponds to a greedy action.

    Acknowledgments

    The authors thank Dr. Etsushi Nameda and Prof. Masahiko Hara for valuable discussions in the early stages of this work. This work was supported in part by the Sekisui Chemical Grant Program for “Research on Manufacturing Based on Innovations Inspired by Nature.”

    Conflict of Interest

    The authors declare that there is no conflicting of interest regarding the publication of this paper.

    [1] Castro LN (2007) Fundamentals of natural computing: an overview. Physics of Life Reviews 4: 1-36. doi: 10.1016/j.plrev.2006.10.002
    [2] Kari L, Rozenberg G (2008) The many facets of natural computing. Communications of the ACM 51: 72-83.
    [3] Rozenberg G, Back T, Kok J (2012) Handbook of natural computing, Springer-Verlag.
    [4] Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization in simulated annealing. Science 220: 671-680. doi: 10.1126/science.220.4598.671
    [5] Hopfield JJ, Tank DW (1985) Neural computation of decisions in optimization problems. Biological Cybernetics 52: 141-152.
    [6] Brady RM (1985) Optimization strategies gleaned from biological evolution. Nature 317: 804-806. doi: 10.1038/317804a0
    [7] Adelman LM (1994) Molecular computation of solutions to combinatorial problems. Science 266: 1021-1024. doi: 10.1126/science.7973651
    [8] Terabe K, Hasegawa T, Nakayama T, et al. (2001) Quantum point contact switch realized by solid electrochemical reaction. RIKEN Review 37: 7-8.
    [9] Terabe K, Hasegawa T, Nakayama T, et al. (2005) Quantized conductance atomic switch. Nature 433: 47-50. doi: 10.1038/nature03190
    [10] Kim S-J, Aono M, Nameda E (2015) Efficient decision-making by volume-conserving physical object. New J Phys 17: 083023. doi: 10.1088/1367-2630/17/8/083023
    [11] Robbins H (1952) Some aspects of the sequential design of experiments. Bull Amer Math Soc 58: 527-536. doi: 10.1090/S0002-9904-1952-09620-8
    [12] Thompson W (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25: 285-294. doi: 10.1093/biomet/25.3-4.285
    [13] Gittins J, Jones D (1974) Dynamic allocation index for the sequential design of experiments, In: Gans, J. Progress in Statistics North Holland, 241-266.
    [14] Gittins J (1979) Bandit processes and dynamic allocation indices. J R Stat Soc B 41: 148-177.
    [15] Agrawal R (1995) Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Adv Appl Prob 27: 1054-1078. doi: 10.2307/1427934
    [16] Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learning 47: 235-256. doi: 10.1023/A:1013689704352
    [17] Lai L, Jiang H, Poor HV (2008) Medium access in cognitive radio networks: a competitive multiarmed bandit framework. Proc. of IEEE 42nd Asilomar Conference on Signals, System and Computers, 98-102.
    [18] Lai L, Gamal HE, Jiang H, et al. (2011) Cognitive medium access: exploration, exploitation, and competition. IEEE Trans. on Mobile Computing 10: 239-253. doi: 10.1109/TMC.2010.65
    [19] Agarwal D, Chen BC, Elango P (2009) Explore/exploit schemes for web content optimization. Proc of ICDM2009, http://dx.doi.org/10.1109/ICDM.2009.52.
    [20] Kocsis L, Szepesv´ari C. (2006) Bandit based monte-carlo planning, In: Carbonell, J. G. et al., 17th European Conference on Machine Learning, Lecture Notes in Artificial Intelligence 4212, Springer, 282-293.
    [21] Gelly S, Wang Y, Munos R, et al. (2006) Modification of UCT with patterns in Monte-Carlo Go. RR-6062-INRIA, 1-19.
    [22] Demis EC, Aguilera R, Sillin HO, et al. (2015) Atomic switch networks - nanoarchitectonic design of a complex system for natural computing. Nanotechnology 26: 204003. doi: 10.1088/0957-4484/26/20/204003
    [23] Avizienis AV, Sillin HO, Martin-Olmos C, et al. (2012) Neuromorphic atomic switch networks. PLoS ONE 7: e42772. doi: 10.1371/journal.pone.0042772
    [24] Sutton R, Barto A (1998) Reinforcement Learning: An Introduction, MIT Press.
    [25] Kim S-J, Aono M, Hara M (2010) Tug-of-war model for multi-armed bandit problem, In: Calude C. et al. Unconventional Computation, Lecture Notes in Computer Science 6079, Springer, 69-80.
    [26] Kim S-J, Aono M, Hara M (2010) Tug-of-war model for the two-bandit problem: Nonlocallycorrelated parallel exploration via resource conservation. BioSystems 101: 29-36. doi: 10.1016/j.biosystems.2010.04.002
    [27] Kim S-J, Naruse M, Aono M, et al. (2013) Decision maker based on nanoscale photo-excitation transfer. Sci Rep 3: 2370. doi: 10.1038/srep02370
    [28] Naruse M, NomuraW, Aono M, et al. (2014) Decision making based on optical excitation transfer via near-field interactions between quantum dots. J Appl Phys 116: 154303. doi: 10.1063/1.4898570
    [29] Naruse M, Berthel M, Drezet A, et al. (2015) Single photon decision maker Sci Rep 5: 13253. doi: 10.1038/srep13253
    [30] Tsuruoka T, Hasegawa T, Terabe K, et al. (2012) Conductance quantization and synaptic behavior in a Ta2O5-based atomic switch. Nanotechnology 23: 435705. doi: 10.1088/0957-4484/23/43/435705
    [31] Roughgarden T (2005) Selfish routing and the price of anarchy, MIT Press, Cambridge.
    [32] Nisan N, Roughgarden T, Tardos E, et al. (2007) Algorithmic Game Theory, Cambridge Univ. Press.
    [33] Kim S-J, Aono M (2015) Decision maker using coupled incompressible-fluid cylinders. Special issue of Advances in Science, Technology and Environmentology B11: 41-45, Available from: http://arxiv.org/abs/1502.03890.
    [34] Kim S-J, Naruse M, Aono M (2015) Harnessing natural fluctuations: analogue computer for efficient socially-maximal decision-making. eprint arXiv, Available from: http://arxiv.org/abs/1504.03451.
    [35] Vermorel J, Mohri M (2005) Multi-armed bandit algorithms and empirical evaluation, In: Gama J., et al. 16th European Conference on Machine Learning. Lecture Notes in Artificial Intelligence 3720, Springer, 437-448.
  • This article has been cited by:

    1. Takashi Tsuchiya, Kazuya Terabe, Masakazu Aono, 2020, Chapter 9, 978-3-030-34874-8, 161, 10.1007/978-3-030-34875-5_9
    2. Takashi Tsuchiya, Kazuya Terabe, Rui Yang, Masakazu Aono, Nanoionic devices: Interface nanoarchitechtonics for physical property tuning and enhancement, 2016, 55, 0021-4922, 1102A4, 10.7567/JJAP.55.1102A4
    3. Song-Ju Kim, Kaori Ohkoda, Masashi Aono, Hisashi Shima, Makoto Takahashi, Yasuhisa Naitoh, Hiroyuki Akinaga, 2019, Reinforcement Learning System Comprising Resistive Analog Neuromorphic Devices, 978-1-5386-9504-3, 1, 10.1109/IRPS.2019.8720428
    4. Takashi Tsuchiya, Tohru Tsuruoka, Song-Ju Kim, Kazuya Terabe, Masakazu Aono, Ionic decision-maker created as novel, solid-state devices, 2018, 4, 2375-2548, eaau2057, 10.1126/sciadv.aau2057
    5. So Hasegawa, Song-Ju Kim, Yozo Shoji, Mikio Hasegawa, 2020, Performance Evaluation of Machine Learning Based Channel Selection Algorithm Implemented on IoT Sensor Devices in Coexisting IoT Networks, 978-1-7281-3893-0, 1, 10.1109/CCNC46108.2020.9045712
    6. Song-Ju Kim, Makoto Naruse, Masashi Aono, Harnessing the Computational Power of Fluids for Optimization of Collective Decision Making, 2016, 1, 2409-9287, 245, 10.3390/philosophies1030245
    7. Carolin Lutz, Tsuyoshi Hasegawa, Takashi Tsuchiya, Christoph Adelsberger, Ryoma Hayakawa, Toyohiro Chikyow, P-type polymer-based Ag2S atomic switch for “tug of war” operation, 2017, 56, 0021-4922, 06GF03, 10.7567/JJAP.56.06GF03
    8. C. Lutz, T. Hasegawa, T. Chikyow, Ag2S atomic switch-based ‘tug of war’ for decision making, 2016, 8, 2040-3364, 14031, 10.1039/C6NR00690F
    9. Koji Oshima, Takuma Onishi, Song-Ju Kim, Jing Ma, Mikio Hasegawa, Efficient wireless network selection by using multi-armed bandit algorithm for mobile terminals, 2020, 11, 2185-4106, 68, 10.1587/nolta.11.68
    10. Song-Ju Kim, Taiki Takahashi, Performance in Multi-Armed Bandit Tasks in Relation to Ambiguity-Preference Within a Learning Algorithm, 2018, 4, 2297-4687, 10.3389/fams.2018.00027
    11. Kazuya Terabe, Takashi Tsuchiya, Rui Yang, Masakazu Aono, Nanoionic devices enabling a multitude of new features, 2016, 8, 2040-3364, 13873, 10.1039/C6NR00956E
    12. Akihiro Tamatsukuri, Tatsuji Takahashi, Guaranteed satisficing and finite regret: Analysis of a cognitive satisficing value function, 2019, 180, 03032647, 46, 10.1016/j.biosystems.2019.02.009
    13. So Hasegawa, Ryoma Kitagawa, Takumi Ito, Takashi Nakajima, Song-Ju Kim, Yozo Shoji, Mikio Hasegawa, 2020, Performance Evaluation of Machine Learning Based Channel Selection Algorithm Implemented on IoT Sensor Devices and Its Application to Wireless Sensor Network for Building Monitoring System, 978-1-7281-4985-1, 161, 10.1109/ICAIIC48513.2020.9065211
    14. Takashi Tsuchiya, Tomonobu Nakayama, Katsuhiko Ariga, Nanoarchitectonics Intelligence with atomic switch and neuromorphic network system, 2022, 15, 1882-0778, 100101, 10.35848/1882-0786/ac926b
    15. So Hasegawa, Yoshito Watanabe, Mikio Hasegawa, Yozo Shoji, 2021, Piggy-back Network to Enable Beyond 5G Society Supported by Autonomous Mobilities: Application of MAB Algorithm for Data Synchronization Between Distributed Collaborative Robots, 978-1-6654-2760-9, 1, 10.1109/WPMC52694.2021.9700441
    16. Song-Ju Kim, Taiki Takahashi, Kazuo Sano, A balance for fairness: fair distribution utilising physics, 2021, 8, 2662-9992, 10.1057/s41599-021-00806-w
    17. Kazuya Terabe, Tohru Tsuruoka, Takashi Tsuchiya, Tsuyoshi Hasegawa, 2022, Chapter 9, 978-3-030-90581-1, 191, 10.1007/978-3-030-90582-8_9
    18. So Hasegawa, Ryoma Kitagawa, Aohan Li, Song-Ju Kim, Yoshito Watanabe, Yozo Shoji, Mikio Hasegawa, Multi-Armed-Bandit Based Channel Selection Algorithm for Massive Heterogeneous Internet of Things Networks, 2022, 12, 2076-3417, 7424, 10.3390/app12157424
    19. Koji Oshima, Daisuke Yamamoto, Atsuhiro Yumoto, Song-Ju Kim, Yusuke Ito, Mikio Hasegawa, Online machine learning algorithms to optimize performances of complex wireless communication systems, 2021, 19, 1551-0018, 2056, 10.3934/mbe.2022097
    20. Daisuke Yamamoto, Honami Furukawa, Aohan Li, Yusuke Ito, Koya Sato, Koji Oshima, So Hasegawa, Yoshito Watanabe, Yozo Shoji, Song-Ju Kim, Mikio Hasegawa, Performance Evaluation of Reinforcement Learning Based Distributed Channel Selection Algorithm in Massive IoT Networks, 2022, 10, 2169-3536, 67870, 10.1109/ACCESS.2022.3186703
    21. Kazuya Terabe, Tohru Tsuruoka, Takashi Tsuchiya, 2022, Chapter 7, 978-4-431-56911-4, 113, 10.1007/978-4-431-56912-1_7
    22. Tohru Tsuruoka, Kazuya Terabe, Solid polymer electrolyte-based atomic switches: from materials to mechanisms and applications, 2024, 25, 1468-6996, 10.1080/14686996.2024.2342772
  • Reader Comments
  • © 2016 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(7280) PDF downloads(1512) Cited by(22)

Article outline

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog