Communication Special Issues

Fault-tolerant Hamiltonian cycle strategy for fast node fault diagnosis based on PMC in data center networks

  • System-level fault diagnosis model, namely, the PMC model, detects fault nodes only through the mutual testing of nodes in the system without physical equipment. In order to achieve server nodes fault diagnosis in large-scale data center networks (DCNs), the traditional algorithm based on the PMC model cannot meet the characteristics of high diagnosability, high accuracy and high efficiency due to its inability to ensure that the test nodes are fault-free. This paper first proposed a fault-tolerant Hamiltonian cycle fault diagnosis (FHFD) algorithm, which tests nodes in the order of the Hamiltonian cycle to ensure that the test nodes are faultless. In order to improve testing efficiency, a hierarchical diagnosis mechanism was further proposed, which recursively divides high scale structures into a large number of low scale structures based on the recursive structure characteristics of DCNs. Additionally, we proved that 2(n2)nk1 and (n2)tn,k/tn,1 faulty nodes could be detected for BCuben,k and DCelln,k within a limited time for the proposed diagnosis strategy. Simulation experiments have also shown that our proposed strategy has improved the diagnosability and test efficiency dramatically.

    Citation: Zhipeng Zhao, Zhenyu Hu, Zhiyu Zhao, Xiaoyu Du, Tianfei Chen, Lijun Sun. Fault-tolerant Hamiltonian cycle strategy for fast node fault diagnosis based on PMC in data center networks[J]. Mathematical Biosciences and Engineering, 2024, 21(2): 2121-2136. doi: 10.3934/mbe.2024093

    Related Papers:

    [1] Xueyan Wang . A fuzzy neural network-based automatic fault diagnosis method for permanent magnet synchronous generators. Mathematical Biosciences and Engineering, 2023, 20(5): 8933-8953. doi: 10.3934/mbe.2023392
    [2] Pu Yang, Zhenbo Li, Yaguang Yu, Jiahui Shi, Ming Sun . Studies on fault diagnosis of dissolved oxygen sensor based on GA-SVM. Mathematical Biosciences and Engineering, 2021, 18(1): 386-399. doi: 10.3934/mbe.2021021
    [3] Rong Sun, Yuntao Han, Yingying Wang . Design of generalized fault diagnosis observer and active adaptive fault tolerant controller for aircraft control system. Mathematical Biosciences and Engineering, 2022, 19(6): 5591-5609. doi: 10.3934/mbe.2022262
    [4] Jinyi Tai, Chang Liu, Xing Wu, Jianwei Yang . Bearing fault diagnosis based on wavelet sparse convolutional network and acoustic emission compression signals. Mathematical Biosciences and Engineering, 2022, 19(8): 8057-8080. doi: 10.3934/mbe.2022377
    [5] Shuchun Yu, Jinjian Tao, Jun Liu, Yanshu Miao . Research on fault diagnosis technology of heat meter based on multi classifier fusion of pigeon swarm algorithm. Mathematical Biosciences and Engineering, 2023, 20(4): 6312-6326. doi: 10.3934/mbe.2023272
    [6] Jing Yang, Guo Xie, Yanxi Yang, Qijun Li, Cheng Yang . A multilevel recovery diagnosis model for rolling bearing faults from imbalanced and partially missing monitoring data. Mathematical Biosciences and Engineering, 2023, 20(3): 5223-5242. doi: 10.3934/mbe.2023242
    [7] Yan Yan, Yong Qian, Hongzhong Ma, Changwu Hu . Research on imbalanced data fault diagnosis of on-load tap changers based on IGWO-WELM. Mathematical Biosciences and Engineering, 2023, 20(3): 4877-4895. doi: 10.3934/mbe.2023226
    [8] Cong Wang, Chang Liu, Mengliang Liao, Qi Yang . An enhanced diagnosis method for weak fault features of bearing acoustic emission signal based on compressed sensing. Mathematical Biosciences and Engineering, 2021, 18(2): 1670-1688. doi: 10.3934/mbe.2021086
    [9] Qiushi Wang, Zhicheng Sun, Yueming Zhu, Chunhe Song, Dong Li . Intelligent fault diagnosis algorithm of rolling bearing based on optimization algorithm fusion convolutional neural network. Mathematical Biosciences and Engineering, 2023, 20(11): 19963-19982. doi: 10.3934/mbe.2023884
    [10] Zhenzhong Chu, Zhenhao Gu, Zhiqiang Li, Yunsai Chen, Mingjun Zhang . A fault diagnostic approach based on PSO-HMM for underwater thrusters. Mathematical Biosciences and Engineering, 2022, 19(12): 12617-12631. doi: 10.3934/mbe.2022589
  • System-level fault diagnosis model, namely, the PMC model, detects fault nodes only through the mutual testing of nodes in the system without physical equipment. In order to achieve server nodes fault diagnosis in large-scale data center networks (DCNs), the traditional algorithm based on the PMC model cannot meet the characteristics of high diagnosability, high accuracy and high efficiency due to its inability to ensure that the test nodes are fault-free. This paper first proposed a fault-tolerant Hamiltonian cycle fault diagnosis (FHFD) algorithm, which tests nodes in the order of the Hamiltonian cycle to ensure that the test nodes are faultless. In order to improve testing efficiency, a hierarchical diagnosis mechanism was further proposed, which recursively divides high scale structures into a large number of low scale structures based on the recursive structure characteristics of DCNs. Additionally, we proved that 2(n2)nk1 and (n2)tn,k/tn,1 faulty nodes could be detected for BCuben,k and DCelln,k within a limited time for the proposed diagnosis strategy. Simulation experiments have also shown that our proposed strategy has improved the diagnosability and test efficiency dramatically.



    With the continuous expansion of the scale of the data center networks (DCNs), the number of servers in the network increases exponentially [1]. The server plays an important role in DCNs, which not only is used to process data but is also needed to forward data. In addition, the probability of server nodes fault is very high and the server failures cause data loss and abnormal data forwarding. Therefore, servers fault diagnosis becomes an inevitable measure to ensure a DCN reliable communication [2].

    Preparata et al. [3] proposed the first system-level fault diagnosis model, namely, the PMC model, which was used to solve the problem of automatic fault diagnosis of the multiprocessor system. Every node in the system is capable of performing tests on its adjacent node. The PMC model assumes that the tests performed by fault-free nodes are always correct, whereas tests performed by faulty nodes are unreliable. Generally, a PMC model is divided into two steps. First, adjacent nodes in the system produce test results by testing each other, which is called syndrome. Second, syndrome be analyzed to find out the faulty nodes. Typically, PMC models focus on diagnostic strategies for the second stage syndrome. Diagnosis strategy contains precise diagnosis [3], pessimistic diagnostics [4] and t/k diagnostics [5] etc. If all fault-free nodes are not mistaken for faulty nodes, it is called precise diagnosis [6]; if there are fault-free nodes that are mistaken for faulty nodes, it is called pessimistic diagnosis [7]. t/k diagnostics is that k fault-free nodes may be mistaken for faulty nodes, so precise and pessimistic diagnosis are special cases of t/k diagnosis [8]. Specifically, t/k diagnosis is precise diagnosis when k = 0, and t/k diagnosis is pessimistic diagnosis when k=1. Many diagnosis algorithms were proposed using precise, pessimistic or t/k diagnosis strategy [9,10].

    In the past, system-level fault diagnosis was commonly used in small multiprocessor systems. Nowadays, system-level fault diagnosis is more studied in DCNs with the development of DCNs. For example, Li et al. [11] studied the diagnosability of precise diagnosis and pessimistic diagnosis of DCelln,k and studied the t/k diagnosability in literature [12]. The conclusions are that the precise diagnosability of DCelln,k is n+k1, the pessimistic diagnosability is 2k+n2 when n2 and k2 and the t/k diagnosability is (k+1)(m1)+n when 1mn1. Huang H [13] studied the diagnosability of precise diagnosis of BCuben,k. The conclusion is that the precise diagnosability of BCuben,k is (n1)(k+1)1 when n2 and k0. However, they are unable to deal with large numbers of fault nodes in DCN due to their limited diagnosability. For example, DCell3,3 contains 24,492 servers with precise diagnosability of five, pessimistic diagnosability of seven and t/k diagnosability of nine. Obviously, there may be more than nine fault nodes in this network.

    To improve diagnosability, Heng et al. [14] proposed a probabilistic diagnosis method. Because it is unreliable for two unknown state nodes to test each other, the more times they tested, the more accurate the test results will be, and finally the states of the two nodes can be obtained. However, multiple tests can cause the low diagnostic efficiency and occupy the large network bandwidth, so it is not suitable for DCN networks with a large number of servers. Li et al. [15] proposed an algorithm with time complexity O(N) for hypercube-like networks by using the Hamiltonian hypercube network and gemini diagnosis structure, which greatly improves efficiency of the algorithm. Ye et al. [16] put forward five-round adaptive diagnosis in Hamiltonian networks, which greatly improves diagnosability.

    However, traditional algorithms based on the PMC model have two stages of system-level diagnosis. The first stage is to test each other between adjacent nodes, in which there may be fault nodes. The test results of fault nodes are uncertain, so it is impossible to get the correct status of all nodes through the syndrome, and it is necessary to apply the diagnosis strategy (precise or pessimistic diagnosis) to the syndrome in the second step to determine the fault node. If the test node is fault-free, then the status of the tested node can be obtained, so there is no need for a second step. This paper first proposes a fault-tolerant Hamiltonian cycle fault diagnosis algorithm (FHFD), which tests nodes in the order of the Hamiltonian cycle to ensure that the test nodes are fault-free and then combines with probability diagnosis methods to improve the diagnosability [17]. In order to improve testing efficiency, a hierarchical diagnosis strategy is also proposed, which recursively divides high scale structures into a large number of low scale structures based on the recursive structure characteristics of DCN. Concretely, we make three main contributions in the strategy.

    (1) Compared to traditional diagnosis strategies, the key difference is that our proposed strategy is more suitable for DCNs with multiple servers. This strategy greatly improves the diagnosability. 2(n2)nk1 and (n2)tn,k/tn,1 fault nodes can be accurately detected for BCuben,k and DCelln,k at most (when n3,k>0).

    (2) There is a misdiagnosis node in pessimistic diagnosis based on the traditional PMC model. The strategy we proposed ensures that the test node is fault-free by the fault-tolerant Hamiltonian cycle so that there is no misdiagnosis node.

    (3) A hierarchical diagnosis mechanism is further proposed to improve testing efficiency, which recursively divides high scale structures into a large number of low scale structures based on the recursive structure characteristics of DCNs.

    The rest of the paper is organized as follows. Preliminaries are introduced in Section 2 and diagnosis strategy based on DCNs is described in Section 3. Performance of the proposed algorithms are shown in Section 4. Finally, we conclude this paper in Section 5.

    In Section 2.1, we will present some notations and terminologies used in this paper. Then, in Section 2.2, we will describe the definition of DCell and BCube structures and some properties of Hamiltonian. Finally, in Section 2.3, we will introduce the PMC model and probabilistic diagnosis method for diagnosis.

    The topology of DCNs can be represented by an undirected graph G=(V(G),E(G)), in which V(G) is the set of vertices and E(G)={u,v|u,vV} represents the set of edges. Vertices and edges represent servers and communication links in DCNs, respectively. For an undirected graph G=(V(G),E(G)), |V| represents the number of servers in G. The edge between vertices vi and vj is denoted by (vi,vj). The neighbor set of a vertex x in G is defined as NG(x)={yV|(x,y)E}. LetLV, GL be denoted as a subgraph with V(GL)=VL,E(GL)={(x,y)E|x,y(VL)}. Path P(v0,vt)=(v0,v1,...,vt) is a sequence of different vertices (except v0 and vt) from v0 to vt, and any two consecutive vertices are adjacent. Below are the following definitions of the Hamiltonian concept:

    Hamiltonian Path: Given graph G, Vi,VjV, if P is a path from Vi to Vj that passes all vertices once, and only once, in G, then P is called a Hamiltonian path from Vi to Vj in G.

    Hamiltonian Cycle: Given graph G, Vi,VjV, starting from Vi, if P is a path from Vi to Vj that passes all vertices once, and only once, in G and finally returns to Vi, then P is called a Hamiltonian cycle from Vi to Vj in G.

    Hamiltonian connected: Given graph G, if there exists a Hamiltonian path between any distinct vertices in G, then the graph G is called Hamiltonian connected or G is a Hamiltonian connected graph.

    F(G) is used to represent the set of fault elements in graph G(V,E) (in this paper, only the set of fault servers), where F(G)V(G). Let f(G)=|F(G)| represent the number of fault servers, and if f(G)=0, then G has no faulty servers.

    Definition 1. Fk-fault-tolerant Hamiltonian graph: If GF(G) is a Hamiltonian graph, then G is an Fk-fault-tolerant Hamiltonian graph where Fk=f(G).

    Hamiltonian cycle is denoted by H(Vh,Eh) in graph G, while G(V,E) is an Fk-fault-tolerant Hamiltonian graph, where Vh=V, EhE, xiVh(1i|V|), then the Hamiltonian cycle path is H<xi1,xi2,...,xi|v|,xi1>, where <i1,i2,...,i|V|> is the sequence combination of [1,...,|V|].

    X(n,k) or Xn,k denotes a DCN with fault-tolerant Hamiltonian cycle and recursive structure, where k represents the hierarchy of structure, n represents the number of servers in X(n,0) and tn,k represents the number of servers in X(n,k).

    DCell and BCube structures exist fault-tolerant Hamiltonian cycle and are also recursive network structures. Next, the recursive construction rules of DCell and BCube and its Hamiltonian properties are introduced, which prepares for the diagnostic strategy proposed in this article.

    Definition 2 [18]. The recursive definition of DCelln,k is as follows:

    (1) DCelln,0 is a complete graph with n vertices.

    (2) When k1, DCelln,k is composed of (tn,k1+1)DCelln,k1. The (i+1)th DCelln,k1 is represented by DCellin,k1, where 0i<tn,k1+1.

    In DCelln,k, the address of the server is represented by akak1...a0(a0[0,n1],ap[0,tp1,n]p[1,k]). According to the coding rules of servers in literature [18], DCellin,k1 contains the address of the server, which is as follows:

    DCellin,k1={akak1...a0|i[0,(tn,k1+1)],ak=i%(tn,k1+1),a0[0,n1],ap[0,tp1,n],p[1,k1]}.  (2.1)

    Wang X [19] studied the Hamiltonian property of DCell and the conclusions are as follows:

    Theorem 1. When n2,k2, DCelln,k (except DCell2,1) is Hamiltonian connected and is a (n+k-3)-fault-tolerant Hamiltonian graph.

    Definition 3 [20]. The recursive definition of BCuben,k is as follows:

    (1) BCuben,0 is a complete graph with n vertices.

    (2) When k1, BCuben,k is composed of nBCuben,k1. The (i+1)th BCuben,k1 is represented by BCubein,k1, where 0i<n.

    In BCuben,k, the address of the server is represented by akak1...a0(a0[ap[0,n1],p[0,k]). According to the coding rules of servers in literature [20], BCubein,k1 contains the address of the server, which is as follows:

    DCubein,k1={akak1...a0|i[0,n],ak=i%n,ap[0,n],p[0,k1]}  (2.2)

    Huang et al. [21] studied the Hamiltonian connection of BCube and Wang et al. [22] studied the fault-tolerant Hamiltonian property of BCube, and their conclusions are as follows:

    Theorem 2. When n3,k0, BCuben,k is Hamiltonian connected, and when n4,k0, BCuben,k is a [(n1)(k+1)2]-fault-tolerant Hamiltonian graph.

    In undirected graph G=(V,E), for any two adjacent nodes (vi,vj), the notation σij is used to represent the result of vi test vj. σij = 0 represents test result as fault-free. On the contrary, σij = 1 represents test result as faulty.

    When vi is fault-free: If vj is fault-free, then σij=0; if vj is faulty, then σij=1.

    When vi is faulty: Whether vj is fault-free or faulty, its test result may be σij=0 or σij=1, and assume that the probability of σij=0 is p, where 0<p<1.

    All possible comparison results are shown in Table 1 for the PMC model.

    Table 1.  PMC model.
    vi vj σi,j
    Fault-free Fault-free 0
    Fault-free Faulty 1
    Faulty Fault-free o or 1
    Faulty Faulty 0 or 1

     | Show Table
    DownLoad: CSV

    In the PMC model, if the result of the test is σij = 0, from Table 1, we can get the corresponding three situations:

    1) Both vi and vj are fault-free;

    2) vi is faulty, but vj is fault-free;

    3) Both vi and vj are faulty.

    We cannot get the precise results though just one test; therefore, we test testing many times between two nodes before they are set to get their state and the probabilistic diagnosis method can be designed.

    Theorem 3. Four diagnosis results would be obtained through the responses of tests executed by each other r times by a pair of adjacent nodes vi and vj (r is large enough):

    (1) If r1σij=r1σji=0, then vi and vj are fault-free;

    (2) If r1σij=r&0<r1σji<r, then vi is fault-free and vj is faulty;

    (3) If 0<r1σij<r&r1σji=r, then vi is faulty and vj is fault-free;

    (4) If 0<r1σij<r&0<r1σji<r, then vi and vj are faulty;

    Proof:

    (1) If vi is faulty, the probability of r1σij=0 can be calculated by binomial distribution: P{X=k}=Crkpk(1p)rk=pk.

    Assuming that the probability of σij=0 is p = 0.5, test times r = 9 and P{X=k}=0.59=0.0019. Since the probability is too small, it can be considered that vi=0 and vj=0.

    (2) If vi = 1, the probability of r1σij=r can be calculated by binomial distribution: P{X=k}=Crkpk(1p)rk=(1p)r.

    Assuming that the probability of σij=0 is p = 0.5, test times r = 9 and P{X=k}=0.59=0.0019. Since the probability is too small, it can be considered that vi1 and vj=0. Since 0<r1σji<r and vi=0, according to the PMC rule, vj=1. Through the same logic, it can be proved that (3) holds.

    (4) Since 0<r1σij<r&0<r1σji<r, it means σji=0 or σji=1, and σji=0 or σji=1. According to the PMC rule, vi=1 and vj=1.

    We propose a novel fault diagnosis strategy, which tests nodes in the order of the Hamiltonian cycle to ensure that test nodes are fault-free. Specifically, the strategy consists of two parts: FHFD algorithm and hierarchical diagnosis method, which is suitable for DCNs with the following conditions:

    (1) Topology G(V,E) of Xn,k is Hamiltonian connected and an Fk-fault-tolerant hamiltonian graph, where k>0;

    (2)m>2, when n>2, k>0, Xn,k consists of mXn,k1, as shown in equation (3.1):

    Xn,k=(m1)(i=0)Xin,k1. (3.1)

    Xin,k1 is the (i+1)th Xn,k1, where 0i<m and m has different values on different network structures. Both BCube and DCell construct rules are shown as in Table 2.

    Table 2.  BCube and DCell construct rules.
    Xn,0 Xn,1 ... Xn,k
    BCube BCuben,0 nBCuben,0 ... nBCuben,k1
    2 DCell (tn,0+1)DCelln,0 ... (tn,k1+1)DCelln,k1

     | Show Table
    DownLoad: CSV

    (3) The address of the server in Xn,k is denoted by akak1...a0(ap[0,m1],p[0,k]). Through (1) and (2), we can distinguish different Xn,k1 by ak, as shown in equation (3.2):

    Xin,k1={akak1...a0|i[0,m],ak=i%m,ap[0,m1],p[0,k]}. (3.2)

    There exists a Hamiltonian cycle H(Vh,Eh) while a graph G(V,E) is Fk-fault-tolerant Hamiltonian. Let the H(Vh,Eh) path be H<xi1,xi2,...,xi|v|,xi1>. We first use probabilistic diagnosis in Theorem 3 to find a fault-free node as xi1, and the state of xi2 will be determined accurately according to the testing of xi1 to xi2. If xi2 is fault-free, the test outcome is accurate while xi2 tests xi3; Therefore, we can get the accurate outcome of xi3 testing xi4, for xi3 is fault-free. In turn, all nodes could be tested until the last node. On the other hand, if xi2 is faulty, two cases are discussed:

    Algorithm 1: FHFD step
    1 Constructing the hamiltonian cycle H(Vh,Eh) of G(V,E).
    2 Let xi1 test xi2 using the probability diagnosis method in Theorem 3; if xi1 = 0, let a = xi1; if xi1 = 1, then reselect two adjacent nodes to test using probabilistic diagnostic methods until the correct node is found.
    3 b,(a,b)Eh, let a test b; if σab=0, let a=b. Repeat step 3 until all nodes are detected; if σab=1, b corresponding fault node is record F(G) and step 4 is executed.
    4 Variable i is used to record the number of fault nodes; if iFk, the new Hamiltonian cycle H(Vhb,Eh) is constructed and step 3 is executed; if i>Fk, step 5 is executed.
    5 Set the next node of b as a, and the next node of a as b. Let a test b with the probability test method, and there are four situations: a=b=1, the faulty nodes a and b are recorded to F(G), and step 5 is repeated until all nodes are detected; a=b=0, let a=b and step 3 is executed; a=0 and b=1, the fault node b is recorded to F(G), and step 5 is repeated until all nodes are detected; a=1 and b=0, the fault node a is recorded to F(G), and let a = b, then step 3 is executed.

    Case 1: f(G)Fk

    G(V,E) is Fk-fault-tolerant Hamiltonian, and deleting Fk faulty nodes can still form a new Hamiltonian cycle. If f(G)Fk, the fault node xi2 can be deleted, then a new Hamiltonian cycle H(Vhxi2,Eh) is generated. Let xi1 continue to test xi3 following Hamiltonian cycle.

    Case 2: f(G)>Fk

    If xi2 is deleted when f(G)>Fk, the remaining nodes will not be able to construct a new Hamiltonian cycle, so let xi3 test xi4 using the probability diagnosis method. Due to the need for repeated tests between two nodes, the test efficiency is low and the network bandwidth is greatly occupied.

    As shown in Figure 1, X(V,E) is 1-fault-tolerant Hamiltonian graph, and the generated Hamiltonian circle is represented by H(Vh,Eh). Assuming p = 0.5, the number of test r = 9 and 91(Xi1,Xi2)=91(Xi2,Xi1)=0 satisfies Case 1 in Theorem 3, then Xi1 and Xi2 are fault-free. Let Xi2 test Xi3 so that Xi3 is faulty node and a new Hamiltonian cycle H(VhXi3,Eh) is constructed, as shown in Figure 2. Let Xi2 test Xi4 so that Xi4 is fault-free, and Xi4 test Xi5 so that Xi5 is faulty. Since X(V,E) is 1-fault-tolerant Hamiltonian graph, deleting two nodes cannot construct a new Hamiltonian cycle and the remaining nodes can use the probability diagnosis method to detect the fault.

    Figure 1.  Topology of H(Vh,Eh).
    Figure 2.  Topology of H(VhXi3,Eh)).

    For Fk-fault-tolerant Hamiltonian graph G(V,E), the relationship between the number of fault nodes and the number of tests in diagnosis is as follows:

    N={|v|2+rf(G)Fk|v|+(r2)[f(G)Fk+1]Fk<f(G). (3.3)

    In equation (3.3), N is the total number of tests, |v| denotes the number of servers in G(V,E), f(G) denotes the number of fault nodes and r is the number of times that two nodes in the probability diagnosis method need to test each other.

    BCube4,4 is 13-fault-tolerant Hamiltonian graph by Theorem 2, where Fk = 13. |v| is the number of servers, where |v| = 1024. Supposing n = 15, the numbers of tests required for different number of faulty nodes are shown in Figure 3 from equation (3.3). DCell5,2 is 4-fault-tolerant Hamiltonian graph by Theorem 1, where Fk = 4 and |v| = 930. Supposing n = 15, the numbers of tests required for different numbers of faulty nodes are shown in Figure 4 from equation (3.3).

    Figure 3.  The number of BCube4,4.
    Figure 4.  The number of BCell5,2.

    As shown in Figure 3 and Figure 4, when f(G)>Fk, the number of tests increases substantially with the number of faulty nodes. On the basis of the previous analysis, we can get that it will take up a lot of bandwidth and more time to test faulty nodes while f(G)>Fk, and when the FHFD algorithms are applied to the node diagnosis of DCNs, there will be some problems as follows because of the large scale of its servers. First, a lot of time would be spent to build Hamiltonian cycles and fault tolerant Hamiltonian cycles. Second, the nodes should be tested in the order of Hamiltonian cycles, and a complete test for the DCNs with thousands of servers will also take a lot of time.

    The total diagnostic time of the FHFD algorithm includes two parts. One part is the test time between nodes. The other part is the time consumed by constructing Hamiltonian cycles and fault-tolerant Hamiltonian cycles. We use MATLAB to simulate the FHFD algorithm for different scale networks, and the running times of the algorithm are shown in Figure 5 and Figure 6.

    Figure 5.  The test time of different scale small network.
    Figure 6.  The test time of different scale big network.

    The experimental results show that:

    1) In Figure 5, when the number of network nodes is small, the test time between nodes is greater than the time used to construct the Hamiltonian cycle. With the increase in the number of nodes, the test time between nodes does not increase much and all the time for constructing Hamiltonian cycles increases sharply.

    2) In Figure 6, when the nodes in the network reach a certain scale, the construction of Hamiltonian cycles consumes a lot of time. For example, it takes 3,000 to construct Hamiltonian cycles for the network with 196 nodes.

    The FHFD algorithm in the test process by breadth-first search to construct the fault-tolerant Hamiltonian cycle is an non-deterministic polynomial (NP)-complete problem, so when the number of nodes increases to a certain value, the time of constructing the Hamiltonian cycle increases sharply. Obviously, such a long diagnosis time cannot meet the actual situation. In the next section, we propose a hierarchical diagnosis method to solve the above problems.

    By equation (3), we have that Xn,k can be divided into mXn,k1. Each Xn,k1 can be divided into mXn,k2, and the lowest can be divided into Xn,0. Therefore, there is the following conclusion:

    Xn,k can be divided into MXn,bs (0b<k), where M is a constant (different structures have different M values):

    We consider the following two cases according to b.

    Case 1: 0<b<k

    Xn,k can be divided into MXn,bs (0<b<k) and Xn,b is an Fk-fault-tolerant Hamiltonian graph. If MXn,b are simultaneously tested, then Xn,k equals to having M Fk-fault-tolerant values. The network structure of BCube3,2 is shown in Figure 7, which can be divided into 3 BCube3,1 for simultaneous testing. By equation (4), Xin,k1={akak1...a0|i[0,m],ak=i%m}; therefore, the nodes contained in BCube3,2 can be divided:

    BCube03,1={000,001,002,010,011,012,020,021,022}
    BCube13,1={100,101,102,110,111,112,120,121,122}
    BCube23,1={200,201,202,210,211,212,220,221,222}
    Figure 7.  The test time of different scale big network.

    By theorem 2, BCube3,2 is 4-fault-tolerant Hamiltonian graph, where fault-tolerant values Fk = 4. BCube3,1 is 2-fault-tolerant Hamiltonian graph, where degree of diagnosability Fk = 2. When the FHFD algorithm is applied to BCube03,1, BCube13,1 and BCube23,1 to complete the test, the sum of fault-tolerant values of three BCube3,1 are Fk = 6, which is two more than fault-tolerant values of BCube3,2, thereby increasing the degree of diagnosability.

    BCube4,4 can be divided into MBCube4,b (0<b<k) and different values of b corresponding M and fault-tolerant values Fk are shown in Table 3.

    Table 3.  Fault-tolerant values of BCube4,4.
    M Fk MFk
    BCube4,3 4 10 40
    BCube4,2 16 7 112
    BCube4,1 64 4 256

     | Show Table
    DownLoad: CSV

    Table 3 shows that the smaller b, the greater the fault tolerance value Fk could be obtained. However, the central server needs to send and collect information to all BCube4,b at the same time during the parallel testing, and a higher performance central server is needed to increase costs with larger M. Therefore, for a more appropriate division of Xn,k into M Xn,b (0<b<k), there is the following equation (3.4):

    H=α|F|βT(tn,p)γC(M). (3.4)

    In equation (3.4), |F| represents the sum of fault-tolerant values Fk of MXn,bs and T(tn,p) represents the time spent on tn,p server tests. C(M) represents performance requirements for central servers. α,β,γ are the weights, and their values of different network structures are also different. The larger the H value, the more reasonable the division.

    For example, when α=0.1,β=0.1,γ=0.5, BCube4,4 can be divided into BCube4,3, BCube4,2 or BCube4,1. The equation (6) can get the following values by taking into the above value.

    H(BCube4,3)=0.14

    H(BCube4,2)=0.77

    H(BCube4,1)=0.76

    Since H(BCube4,2) is the largest, it is most reasonable to divide BCube4,4 into 16 BCube4,2.

    Case 2: b=0

    Xn,k is divided into MXn,0, where n is sufficiently large. By definition 2 and 3, Xn,0 is a complete graph G(V,E) and x,xV,NG(x)=Vx. That is, x is adjacent to all other nodes. If x is fault-free, using x to test the remaining nodes in Xn,0 can accurately measure the state of other nodes. This case does not need to generate a Hamiltonian cycle for diagnosis, which can greatly improve the test efficiency.

    In this section, the FHFD algorithm and hierarchical testing method are applied to BCube and DCell networks, respectively, and their diagnosabilities are analyzed and compared with traditional diagnostic strategies.

    The diagnosability of the FHFD algorithm (in only Case 1) combined with hierarchical test method for DCell is as follows:

    Theorem 4: The maximum diagnosability of DCelln,k is (n2)(tn,k/tn,1) by combining the FHFD algorithm and hierarchical test method with n4 and k>0.

    Proof: DCelln,k can be divided into (tn,k/tn,1)DCelln,1 by equation (3) and DCelln,1 is (n-2)-fault tolerant Hamiltonian by Theorem 1, then the sum of fault tolerant value of (tn,k/tn,1)DCelln,1 is (n2)(tn,k/tn,1).

    We summarize the diagnosability of DCelln,k based on different strategies in the PMC model in Table 4, which shows that the FHFD algorithm combined with hierarchical testing can greatly improve the diagnosability.

    Table 4.  Diagnosability of DCelln,k.
    DCell3,2 DCell4,2 DCell3,3 DCell4,3
    tnk 156 420 24492 176820
    precise 4 5 5 6
    pessimistic 5 6 7 8
    t/c 6 7 9 12
    FHFD+Hierarchical 13 42 2041 17682

     | Show Table
    DownLoad: CSV

    This section will study the diagnosability of the FHFD algorithm (in only Case 1) combined with hierarchical test method for BCube.

    Theorem 5: The maximum diagnosability of BCuben,k is 2(n2)nk1 by combining the FHFD algorithm and hierarchical test method while n4 and k>0.

    Proof: By equation (3), BCuben,k=nk1BCuben,1 and BCuben,1 is 2(n-2)-fault tolerant Hamiltonian by Theorem 2 and, thus, the sum of diagnosability of nk1BCuben,1 is 2(n2)nk1.

    We summarize the diagnosability of BCuben,k based on different strategies in the PMC model in Table 5, which shows that the FHFD algorithm combined with hierarchical testing can greatly improve the degree of diagnosability.

    Table 5.  Diagnosability of BCuben,k.
    BCube3,2 BCube4,2 BCube3,3 BCube4,3
    tnk 64 256 1024 4096
    precise 8 11 14 17
    FHFD+Hierarchical 16 64 256 1024

     | Show Table
    DownLoad: CSV

    This section simulates the test time of FHFD and the hierarchical method in BCube network by MATLAB, as shown in Figure 8.

    Figure 8.  The testing time of FHFD algorithm and hierarchical method.

    (1) BCube3,4 has 243 server nodes, and diagnosis only spends 0.97s. BCube4,4 has 1,024 nodes and diagnosis only spends 5.12s, which shows that the time consumed increases linearly as the number of server nodes increases. This result proves that server nodes have a significant impact on diagnostic time.

    (2) BCube4,4 has 1,024 nodes and spends 5.12s in the actual test. BCube6,3 has 1296 nodes and spends 21.38s in the actual test, which shows that the size of the two networks is similar but the test time is quite different. The reason is that the two are divided into layers through the Hierarchical Diagnosis Based on Recursive (HDBR) algorithm. BCube6,1contains 36 nodes, BCube4,1 contains 16 nodes and the time is different to construct Hamiltonian cycles for 36 nodes and 16 nodes, resulting in a large difference in the final test time but it is still acceptable.

    In this paper, we proposed a novel node fault diagnosis strategy based on the PMC model in DCNs structure, satisfying recursiveness by using fault-tolerant Hamiltonian cycle property. Compared with the traditional diagnosis strategy, our proposed strategy can meet the characteristics of high diagnosability, high accuracy and high efficiency. Therefore, this strategy is more suitable for system-level fault diagnosis of DCNs.

    The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

    This work was supported by the National Natural Science Foundation of China (61672209, 61701170, 62173127, 61973104), Grain Information Processing Center in Henan University of Technology (KFJJ-2020-111), China Postdoctoral Science Foundation funded project (2014M560439), Jiangsu Planned Projects for Postdoctoral Research Funds (1302084B), Scientific & Technological Support Project of Jiangsu Province (BE2016185) and the Science and technology development plan of Henan Province (192102210282, 202102210327).

    The authors declare there is no conflict of interest.



    [1] Z. P. Zhao, W. D. Yang, B. Wu, Flow aggregation through dynamic routing overlaps in software defined networks, Elsevier Comput. Networks, 176 (2020), 107293. https://doi.org/10.1016/j.comnet.2020.107293 doi: 10.1016/j.comnet.2020.107293
    [2] N. W. Chang, S. Y. Hsieh, Conditional diagnosability of augmented cubes under the PMC model, IEEE Transact. Dependable Secure Comput., 9 (2011), 46–60. https://doi.org/10.1109/TDSC.2010.59 doi: 10.1109/TDSC.2010.59
    [3] F. P. Preparata, G. Metze, F. P. Chien, On the connection assignment problem of diagnosable systems, IEEE Transact. Electron. Comput., 16 (2006), 848–854.
    [4] S. Karunanithi, A. D. Friedman, Analysis of digital systems using a new measure of system diagnosis, IEEE Transact. Comput., 28 (2006), 121-133. https://doi.org/10.1109/TC.1979.1675301 doi: 10.1109/TC.1979.1675301
    [5] A. K. Somani, O. Peleg, On diagnosability of large fault sets in regular topology-based computer systems, IEEE Transact. Computers, 45 (1996), 892–903. https://doi.org/10.1109/12.536232 doi: 10.1109/12.536232
    [6] H. Zhuang, S. Zheng, X. Liu, C. K. Lin, X. Li, Reliability evaluation of generalized exchanged hypercubes based on imprecise diagnosis strategies, Parallel Process. Letters, 31 (2021), 2150005.
    [7] C. H. Tsai, A quick pessimistic diagnosis algorithm for hypercube-Like multiprocessor systems under the PMC model, IEEE Transact. Computers, 62 (2013), 259–267. https://doi.org/10.1109/TC.2011.228 doi: 10.1109/TC.2011.228
    [8] S. Zhou, L. Lin, L. Xu, D. Wang, The t/k -diagnosability of star graph networks, IEEE Transact. Comput., 64 (2015), 547–555. https://doi.org/10.1109/TC.2013.228 doi: 10.1109/TC.2013.228
    [9] D. Cheng, The pessimistic diagnosability of graphs and its applications to four kinds of interconnection networks, Int. J. Computer Math. Computer Syst. Theory, 4 (2019), 37–47. https://doi.org/10.1080/23799927.2019.1567593 doi: 10.1080/23799927.2019.1567593
    [10] L. Lin, Y. Huang, L. Xu, S. Y. Hsieh, A pessimistic fault diagnosability of large-scale connected networks via extra connectivity, IEEE Transact. Parallel Distributed Syst., 33 (2021), 415–428. https://doi.org/10.1109/TPDS.2021.3093243 doi: 10.1109/TPDS.2021.3093243
    [11] X. Li, J. Fan, C. K. Lin, X. Jia, Diagnosability evaluation of the data center network DCell, Computer J., 61 (2018), 129–143. https://doi.org/10.1093/comjnl/bxx057 doi: 10.1093/comjnl/bxx057
    [12] X. Li, J. Fan, C. K. Lin, B. Cheng, X. Jia, The extra connectivity, extra conditional diagnosability and t/k-diagnosability of the data center network DCell, Theor. Computer Sci., 766 (2019), 16–29.
    [13] H. Huang, Y. Chen, X. Liu, Z. Han, F. Xiao, t-diagnosability and conditional diagnosability of BCube networks, in IEEE 21st International Conference on High Performance Computing and Communications, (2019). https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00328
    [14] N. X. Heng, C. Z. Run, Z. Miao, A hierarchical fault diagnosis algorithm for data center networks, J. Electron., 42 (2014), 2536–2542. https://doi.org/10.3969/j.issn.0372-2112.2014.12.029 doi: 10.3969/j.issn.0372-2112.2014.12.029
    [15] T. K. Li, C. H. Tsai, A fast fault-identification algorithm for bijective connection graphs using the PMC model, Inform. Sci., 187 (2012), 291–297. https://doi.org/10.1016/j.ins.2011.10.022 doi: 10.1016/j.ins.2011.10.022
    [16] L. C. Ye, J. R. Ye, Five-round adaptive diagnosis in hamiltonian networks, IEEE Transact. Parallel Distributed Syst., 26 (2015), 2459–2464. https://doi.org/10.1109/TPDS.2014.2350480 doi: 10.1109/TPDS.2014.2350480
    [17] X. L. Sun, J X Fan, B L Cheng, Y. Wang, L. Zhang, Probabilistic fault diagnosis of clustered faults for multiprocessor systems, J. Comput. Sci. Technol., 38 (2023), 821–833.
    [18] C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, S. Lu, Dcell: A scalable and Fault-tolerant network structure for data centers, SIGCOMM, (2008), 75–86.
    [19] X. Wang, A. Erickson, J. Fan, X. Jia, Hamiltonian properties of DCell networks, Comput. J., 11 (2015), 1–11. https://doi.org/10.1093/comjnl/bxv019 doi: 10.1093/comjnl/bxv019
    [20] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, et al., BCube: A high performance, server-centric network architecture for modular data centers, SIGCOMM, (2009), 63–74. https://doi.org/10.1145/1594977.1592577
    [21] C. H. Huang, J. F. Fang, The pancyclicity and the hamiltonian-connectivity of the generalized base-b hypercube, J. Supercomput., 34 (2008), 263–269.
    [22] G. J. Wang, C. K. Lin, J. X. Lin, J. Y. Lin, B. L. Lin, Fault-tolerant hamiltonicity and hamiltonian connectivity of BCube with various faulty elements, J. Comput. Sci. Technol., 35 (2020), 1064–1083.
  • This article has been cited by:

    1. Wanling Lin, Xiao-Yan Li, Jou-Ming Chang, Xiangke Wang, An Improved Fault Diagnosis Algorithm for Highly Scalable Data Center Networks, 2024, 12, 2227-7390, 597, 10.3390/math12040597
    2. Ameerah Abdulwahhab Flaifel, Abbas Fadel Mohammed, Fatima kadhem Abd, Mahmood H. Enad, Ahmad H. Sabry, Early detection of arc faults in DC microgrids using wavelet-based feature extraction and deep learning, 2024, 18, 1863-2386, 195, 10.1007/s11761-024-00420-z
  • Reader Comments
  • © 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(1344) PDF downloads(36) Cited by(2)

Figures and Tables

Figures(8)  /  Tables(5)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog