Multilabel learning is an important topic in machine learning research. Evaluating models in multilabel settings requires specific cross validation methods designed for multilabel data. In this article, we show that the most widely used cross validation split quality measure does not behave adequately with multilabel data that has strong class imbalance. We present improved measures and an algorithm, optisplit, for optimizing cross validations splits. Extensive comparison of various types of cross validation methods shows that optisplit produces more even cross validation splits than the existing methods and it is among the fastest methods with good splitting performance.
Citation: Henri Tiittanen, Liisa Holm, Petri Törönen. Novel split quality measures for stratified multilabel cross validation with application to large and sparse gene ontology datasets[J]. Applied Computing and Intelligence, 2022, 2(1): 49-62. doi: 10.3934/aci.2022003
Related Papers:
[1]
Gani Stamov, Ekaterina Gospodinova, Ivanka Stamova .
Practical exponential stability with respect to h−manifolds of discontinuous delayed Cohen–Grossberg neural networks with variable impulsive perturbations. Mathematical Modelling and Control, 2021, 1(1): 26-34.
doi: 10.3934/mmc.2021003
[2]
Bangxin Jiang, Yijun Lou, Jianquan Lu .
Input-to-state stability of delayed systems with bounded-delay impulses. Mathematical Modelling and Control, 2022, 2(2): 44-54.
doi: 10.3934/mmc.2022006
[3]
Shipeng Li .
Impulsive control for stationary oscillation of nonlinear delay systems and applications. Mathematical Modelling and Control, 2023, 3(4): 267-277.
doi: 10.3934/mmc.2023023
[4]
Yanchao He, Yuzhen Bai .
Finite-time stability and applications of positive switched linear delayed impulsive systems. Mathematical Modelling and Control, 2024, 4(2): 178-194.
doi: 10.3934/mmc.2024016
[5]
Xingwen Liu, Shouming Zhong .
Stability analysis of delayed switched cascade nonlinear systems with uniform switching signals. Mathematical Modelling and Control, 2021, 1(2): 90-101.
doi: 10.3934/mmc.2021007
[6]
Hongyu Ma, Dadong Tian, Mei Li, Chao Zhang .
Reachable set estimation for 2-D switched nonlinear positive systems with impulsive effects and bounded disturbances described by the Roesser model. Mathematical Modelling and Control, 2024, 4(2): 152-162.
doi: 10.3934/mmc.2024014
[7]
M. Haripriya, A. Manivannan, S. Dhanasekar, S. Lakshmanan .
Finite-time synchronization of delayed complex dynamical networks via sampled-data controller. Mathematical Modelling and Control, 2025, 5(1): 73-84.
doi: 10.3934/mmc.2025006
[8]
Tengda Wei, Xiang Xie, Xiaodi Li .
Persistence and periodicity of survival red blood cells model with time-varying delays and impulses. Mathematical Modelling and Control, 2021, 1(1): 12-25.
doi: 10.3934/mmc.2021002
[9]
Qin Xu, Xiao Wang, Yicheng Liu .
Emergent behavior of Cucker–Smale model with time-varying topological structures and reaction-type delays. Mathematical Modelling and Control, 2022, 2(4): 200-218.
doi: 10.3934/mmc.2022020
[10]
Guoyi Li, Jun Wang, Kaibo Shi, Yiqian Tang .
Some novel results for DNNs via relaxed Lyapunov functionals. Mathematical Modelling and Control, 2024, 4(1): 110-118.
doi: 10.3934/mmc.2024010
Abstract
Multilabel learning is an important topic in machine learning research. Evaluating models in multilabel settings requires specific cross validation methods designed for multilabel data. In this article, we show that the most widely used cross validation split quality measure does not behave adequately with multilabel data that has strong class imbalance. We present improved measures and an algorithm, optisplit, for optimizing cross validations splits. Extensive comparison of various types of cross validation methods shows that optisplit produces more even cross validation splits than the existing methods and it is among the fastest methods with good splitting performance.
1.
Introduction
Internet of Things (IoT) devices are often limited by storage, computing and energy consumption, and cannot run high-strength encryption algorithms and protocols. Lightweight algorithms and protocols have emerged to ensure device security while saving resource consumption. However, these devices always generate, process, transmit and store private information. The security of these devices is receiving increasing attention and the cryptographic technology is the key to ensuring these security requirements. Therefore, the widespread use of resource constrained devices has attracted attention to lightweight cryptographic ciphers, such as Ascon [1], CLEFIA [2], HIGHT [3], KATAN [4], LED [5], Midori [6], Piccolo [7], PRESENT [8], PRINCE [9] and so on.
Lightweight block cipher Midori is designed by Banik et al. [6] and presented at the ASIACRYPT 2015 conference. It can be divided into two variants, Midori-64 and Midori-128, depending on the block size. In order to reduce energy consumption during algorithm execution, the optimization part of Midori includes replacing 8-bit S-Boxes with 4-bit S-Boxes and replacing maximum separable distance (MDS) matrices with almost MDS binary matrices. With these optimizations, Midori seems to achieve lower latency while maintaining a smaller area. However, the security (including both theoretical and practical security) of lightweight ciphers is a key factor in protecting security sensitive data within the chip from attackers. The theoretical analysis results of Midori include differential analysis [6], linear analysis [6], truncated differential and related-key differential attacks [10], meet-in-middle attack [11] and impossible differential analysi [12,13,14]. In fact, the practical attacks are more important for lightweight ciphers, but for Midori, such cryptanalysis [15,16,17,18] are not adequate.
Differential fault analysis (DFA) is a typical cryptanalysis technique used to attack cryptographic implementations and devices. It was introduced by Biham and Shamir [19] against DES like cryptographic systems. Up to now, a large number of cryptographic algorithms are threatened by the differential fault analysis, such as AES [20,21], DES [22] and SM4 [23] (also called SMS4). The countermeasures against fault attacks can be divided into two categories. The first type detects fault injection by adding hardware sensors. The main drawback of this attack is that the fault detection technique targets exact fault injection methods and cannot prevent all fault injection methods. The second category requires increasing hardware surfaces and number of operations to protect the hardware implementation from fault attacks. A more effective approach is to have a trade-off between efficiency and protection. This type of countermeasure typically protects the hardware implementation by resisting the state-of-the-art fault attacks. The same countermeasures have been applied to AES [24], DES [25], etc.
In [15], Cheng et al. introduced a 3-round fault propagation property of single fault. Based on the vulnerability caused by the almost MDS matrix, they analyzed distinct patterns of nonzero differentials in the ciphertexts and found that the fault injection position could be inferred only using correct and faulty ciphertext. By retrieving the related subkeys, the secret key search space is reduced from 2128 to 280 for Midori-64 and from 2128 to 232 for Midori-128, respectively. In [16], Wang et al. extended the differential propagation trail to 4 rounds and evaluated the remained key entropy of Midori. Based on the 4-round fault propagation path, the key entropy of Midori-64 and Midori-128 decreased to 99.72 and 71.98 bits, respectively. The key recovery process requires a 3-round manual differential analysis and/or algebraic fault analysis, and the entropy of Midori-128 was ultimately reduced to 68.47 bits. In [17], Nozaki et al. injected faults in the last 2 rounds and proposed a 2 stages statistical fault analysis to analyze the 128-bit key of Midori. For Midori-64, the whitening key WK is recovered by injecting 16 different faults in the last round. Subsequently, they used the whitening key to decrypt 1 round and obtained the output of the 15-th round. Then some new faults are injected into the 15-th round, so that they can guess 65,536 keys for one column at a time. In [18], Li et al. proposed a ciphertext-only fault analysis with 6 different distinguishers to break Midori. They encrypt the same plaintext under the same key and inject faults in the penultimate round to obtain enough ciphertexts. Then, a differential analysis process is executed column by column to recover the secret key used in the last 2 rounds. Their work also indicates that the hamming weight distinguisher works best, requiring 280 and 132 ciphertexts for Midori-64 and Midori-128, respectively. In Table 1, we compare our works with previous fault attacks on Midori.
Table 1.
Comparison of this work with previous fault attacks on Midori.
Using the meet-in-the-middle fault attack [26], 7 conditions can be attached to 4-round cell-oriented fault propagation trail, which allows us to reduce key space of the last round. Comparison with the work of [16], the key space is reduced from 299.72 to 254.01 for Midori-64 and from 271.98 to 232.12 for Midori-128. In addition, using the 4-round fault trail, the complexity of the secret key recovery for Midori-64 is reduced from the previous 280 to 228, 232 and 256 for the different cases. Our results indicate that the first and last four rounds of Midori must be protected to maintain the security.
The structure of the remaining part of the paper is as follows. Section 2 gives a brief description of Midori and provides some notations. Section 3 studies the 4-round fault propagation trail. Section 4 introduces the key recovery approach and experimental results, and Section 5 summarizes the article.
2.
Description of Midori and notations
Midori is an SPN structure block cipher with a key size of 128 bits. Its state is represented by a 4 by 4 matrix and is defined in Eq (2.1), where si∈{0,1}m and the length m of each cell is 4 (for Midori-64) or 8 (for Midori-128).
S=[s0s4s8s12s1s5s9s13s2s6s10s14s3s7s11s15]
(2.1)
The number of rounds to be performed during the execution of the algorithm is dependent on the block size n. The number of rounds is represented by R, where R = 16 and R = 20 when n = 64 and n = 128, respectively.
2.1. SBoxes and matrices
Midori contains 2 bijective 4-bit S-Boxes, Sb0 and Sb1. The 8-bit S-Boxes of Midori-128 are generated by Sb1, as detailed in [6]. For Midori-64, Sb0 is applied to each cell of state S: si = Sb0(si),0≤i≤15. Similarly, for Midori-128, 4 8-bit S-Boxes SSb0, SSb1, SSb2 and SSb3 are utilized, please see the design document [6] for details. The matrix M of Midori is defined in Eq (2.2) and the 4 m-bit values (x0,x1,x2,x3) are updated by Eq (2.3).
M=(0110101011001111)
(2.2)
(x0,x1,x2,x3)T=M⋅(x0,x1,x2,x3)T
(2.3)
2.2. Round function
Midori's round function consists of four steps, namely S-layer SubCell, P-layer ShuffleCell and MixColumn, and key-addition layer KeyAdd. The n-bit state is updated at each layer as follows.
SubCell(S): For Midori-64, Sb0 is applied to every 4-bit cell, i.e., si = Sb0(si), where 0≤i≤15. For Midori-128, si = SSbimod4(si), where 0≤i≤15.
ShuffleCell(S):Each cell of the state is permuted as follows: (s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14) = (s0, s10, s5, s15, s14, s4, s11, s1, s9, s3, s12, s6, s7, s13, s2, s8).
MixColumn(S):The matrix M is used for updating each column of S, i.e., (si,si+1,si+2,si+3)T = M⋅(si,si+1,si+2,si+3)T, where i=0,4,8,12.
KeyAdd(S,RKi):The i-th n-bit round key RKi is XORed to the state S.
2.3. Ciphers
For encryption, Midori uses a round function that is composed of four different cell-oriented transformations including SubCell, ShuffleCell, MixColumn and KeyAdd. The plaintext or ciphertext is loaded into the state. After an initial whitening key addition, the state is transformed by implementing a round function 16 or 20 times, with the final round differing slightly from the first R−1 rounds. The round function is parameterized using a key schedule and the overall structure of encryption is depicted in Figure 1. For decryption, the inverse round function consists of SubCell, MixColumn, InvShuffleCell and KeyAdd. The permutation of InvShuffleCell is defined as follows: (s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14) = (s0, s7, s14, s9, s5, s2, s11, s12, s15, s8, s1, s6, s10, s13, s3, s4). The round key used by KeyAdd of the inverse round function is performs the following operations: L−1(RKi) = InvShuffleCell ⋅ MixColumn(RKi). Then, L−1(RKi) is XORed to the state S.
Figure 1.
The overall encryption structure of Midori.
For Midori-64, the 128-bit master key is divided into 2 64-bit values K0 and K1. The whitening key WK=K0⊕K1 and the round key RKi=Kimod2⊕αi, where 0≤i≤14. For Midori128, WK=K and RKi=K⊕βi, where 0≤i≤19. The specific values of constants αi and βi can be found in [6]. In our attack process, Midori-128 relies on the last round to recover the secret key K, while Midori-64 needs to utilize the last two rounds to obtain the 128-bit K.
2.5. Notations
DFA applies key recovery attacks using the relationship between the secret key and behavior information under the faults in the intermediate state. It derives information about the secret key through the difference between correct and faulty ciphertext (using the same plaintext). For the sake of description, some symbols are introduced in this part.
1)Xi represents the input of the i-th round, where 0≤i≤R, X0 is the plaintext and XR is the ciphertext.
2)Xi[j] denotes the j-th cell of Xi, where 0≤j≤15.
3)RKi is the round key for the the i-th round, where 1≤i<R and RK0=RKR=WK.
4)ΔX is the XOR difference of 2 states X and X′, i.e., ΔX = X⊕X′.
5)δx is the XOR difference of 2 cells x and x′, i.e., δx = x⊕x′.
3.
Fault Propagation and properties of Midori
3.1. Fault Propagation in the last 4 rounds
For the sake of simplicity, we assume that the fault occurred in the first cell. As shown in Figure 2, the falut F introduced in the (R-3)-th round is converted to F1 after SubCell and ShuffleCell. Due to the influence of MixColumn layer, the difference of the other three cells belonging to the same column becomes F1, and the KeyAdd step has no effect on the difference of S. The output difference of the (R-2)-th round is ΔSR−2 = {δs4=δs5=δs6=F2,δs8=δs10=δs11=F3,δs12=δs13=δs15=F3}. The output differential values after the (R-1)-th round is ΔSR−1 = {δs0=F5,δs1=F6,δs2=F7,δs3=F8,δs4=δs7=F9,δs5=F4c,δs6=F2a, δs8=δs9=F10,δs10=F2c,δs11=F3a,δs12=δs14=F11,δs13=F4a,δs15=F3b}. Here, F6⊕F7⊕F8 = 0, F9 = F4c⊕F2a, F10 = F2c⊕F3a, F11 = F4a⊕F3b.
For any ciphertext C and C′, by defining OD(WK[i])=Subcell(C[i]⊕WK[i])⊕Subcell(C′[i]⊕WK[i]), the equation system (3.1) will be obtained. If the attacker knows which byte injected a fault, combining equation system (3.1), the 15m bits of WK will be recovered. There are two forms of equations in system (3.1), the first satisfying OD(WK[i])=OD(WK[j]) and the second conforming to OD(WK[i])⊕OD(WK[j])=OD(WK[k]). The method in [26] will be applied to recover the corresponding key with a complexity of 2m and 22m, respectively.
After WK is recovered, we will obtain SR−1 and S′R−1 by decrypting C and C′. For state SR−1 and S′R−1, assuming SID(RKR−1[i],RKR−1[j],RKR−1[k])=Subcell(SR−1[i]⊕SR−1[j]⊕SR−1[k]⊕RKR−1[i]⊕RKR−1[j]⊕RKR−1[k])⊕Subcell(S′R−1[i]⊕S′R−1[j]⊕S′R−1[k]⊕RKR−1[i]⊕RKR−1[j]⊕RKR−1[k]), the equation system (3.2) will be acquired. For Midori-64, the 64-bit RK14 will be obtain with complexity of 23m and the details are provided in Section 4.1.
From here until the end of the paper, we assume that t represents the number of pairs used for the attack. There are 16 different single-cell fault injection positions and each form corresponds to a set of equations similar to the equation systems (3.1) and (3.2). Meanwhile, we should consider 3 different fault injection approaches. In the first case, we know the exact location of the fault injection, but it may not necessarily be in the same location. The specific equation systems that match the fault ciphertext pairs will be used for the key recovery process. In the second scenario, we inject faults using the same method at the same time, i.e., fault occurs in the same cell. In the third case, which is the worst case, we do not impose any constraints on the falut injection location. Here, for all correct and faulty ciphertext pairs, we need to calculate each of the 16 intermediate states and this requires an additional 24t(=16t) time complexity.
3.2. Characteristics of SBoxes
For x, y, z and the differences δx, δy, δz, the relationship of input differential and output differential in the SubCell is defined as follows:
When only 1 cell-oriented fault is injected and tested 224 times for Midori-64, the positions of the faults satisfy the uniform distribution and the maximum probability of Eqs (3.3) and (3.4) are 2−2=64/256. Equations (3.3) and (3.4) only involve 2 and 3 key calculations, and only 4 and 6 pairs of ciphertexts are needed to reduce the size of the candidate key set to 1. All equations in (3.2) require guessing the keys of 6 cells, and the size of the candidate key set obtained using 6 ciphertext pairs is 212=24×6×2−2×6. We need an additional 6 pairs of ciphertext to filter 212 candidate keys. Fortunately, the equations in (3.2) can provide an additional 3 filtering conditions, each of which can filter 212 keys. For example, the filtering condition corresponding to the first two equations in (3.2) is SID(RK14[0],RK14[1],RK14[3]) = SID(RK14[8],RK14[9],RK14[10]), which can reduce the size of the candidate key set for the first two equations to 1. Thus, we only need 6 pairs of ciphertext to reduce the number of candidate key sets to an acceptable level. In the following sections, we will provide the corresponding experimental results to recover the 128-bit key of Midori.
Algorithm 1 2-cell key recovery procedure
Input: a pair of correct and/or faulty ciphertexts (C, C′), cell number i and j.
Output: candidate key list Li,j.
1: fork=0 to 2m−1do {2m time and memory}
2: Set WK[i]=k.
3: Calculate OD(WK[i]) and store (OD(WK[i]),WK[i]) in Hash Table L1.
4: end for 5: fork=0 to 2m−1do {2m time and 1 memory}
6: Set WK[j]=k.
7: Calculate OD(WK[j]) and look in L1 corresponding to OD(WK[i]).
8: ifOD(WK[i]) = OD(WK[j])then 9: Add (WK[i],WK[j]) to the candidate key list Li,j.
10: end if 11: end for
Algorithm 2 3-cell key recovery procedure
Input: a pair of correct and/or faulty ciphertexts (C, C′), cell number i, j and k.
Output: candidate key list Li,j,k.
1: forx=0 to 2m−1do {2m time and memory}
2: Set WK[i]=x.
3: Calculate OD(WK[i]) and store (OD(WK[i]),WK[i]) in Hash Table L1.
4: end for 5: for all (WK[j],WK[k])∈F2m×F2mdo {22m time and 1 memory}
6: Calculate Temp = OD(WK[j])⊕OD(WK[k]) and look in L1 corresponding to OD(WK[i]).
7: ifTemp = OD(WK[i])then 8: Add (WK[i],WK[j],WK[k]) to the candidate key list Li,j,k.
9: end if 10: end for
4.
Key recovery and experiment
4.1. Key recovery of Midori
In our attack, we implement a 1-byte fault injection between the MixColumn of the (R-4)-th round and the MixColumn of the (R-3)-th round on Midori. Assuming our fault attack approach requires t correct and/or faulty ciphertext pairs, denoted as (Ci,C′i)(1≤i≤6), we can use the same method as in Section 3.1 to derive 16 different fault propagation patterns (the work of Cheng et al. also induced fault propagation patterns). Based on the fault injection way, we will propose 3 different key recovery attacks, i.e., 2-cell key recovery procedure, 3-cell key recovery procedure and 6-cell key recovery procedure, please see the Algorithms 1–3 for details.
If a fault is fixed in the first cell which corresponds to the first case of Section 3.1, the following technique allows us to recover 15m bits of WK in time 6×22m and memory 6×2m for Midori, see also Algorithm 4. In this algorithm, the time complexity and memory usage of steps 2 to 7 are 2m. The time complexity of steps 8 to 15 is 22m, and the memory consumption is 2m. Thus, the key recovery process need 6×22m time and 6×2m memory. In fact, the memory usage of Algorithm 4 can be reduced to 2m. By guessing all the values of WK[0], the 128 bits WK will be recovered for Midori-128. However, for Midori-64, we can use equation system (3.2) to recover 64-bit RK14. With the help of the key generation algorithm, the 128 bits key of Midori-64 can be recovered with 6×24m time and 6×23m memory.
When the faults occur in an unknown specific cell, which corresponds to the second scenario, we need to traverse all the 16 patterns one by one, and the time complexity of the attack requires an additional 16, i.e., 6×25m for Midori-64 and 6×23m for Midori-128. For random single-cell faults, the time complexity of our attack is 6×25m×24t for Midori-64 and 6×23m×24t for Midori-128.
Algorithm 3 6-cell key recovery procedure
Input: a pair of correct and/or faulty ciphertexts (C, C′), cell number i, j, k, l, m and n.
Output: candidate key list Li,j,k,l,m,n.
1: for all (WK[i],WK[j],WK[k])∈F2m×F2m×F2mdo {23m time and memory}
2: Calculate Δ = SID(RK14[i], RK14[j], RK14[k]) and store (Δ,RK14[i], RK14[j], RK14[k]) in Hash Table L1.
3: end for 4: for all (WK[l],WK[m],WK[n])∈F2m×F2m×F2mdo {23m time and 1 memory}
5: Calculate Temp = OD(WK[l])⊕OD(WK[m]⊕OD(WK[n]) and look in L1 corresponding to Δ.
6: ifTemp = Δthen 7: Add (WK[i],WK[j],WK[k],WK[l],WK[m],WK[n]) to the candidate key list Li,j,k,l,m,n.
8: end if 9: end for
4.2. Experimental result
Our attacks implemented a PC using Visual Studio 2019 with Intel(R) Core(TM) i5-10210U CPU and 16 GB memory. We use software simulation for the fault injection. Six simulated faults are induced into the first cell of the (R-3)-th round. We tested the first and fourth equations of (3.1) using 6 ciphertext pairs, and the experimental results are shown in Table 2. Furthermore, we tested the first two equations of (3.2), and the results are shown in Table 3. Each test is measured 10,000 times, and the average is taken as the final result.
From the above tables, it can be seen that the differential propagation characteristics of Midori are much worse than expected. As shown in Table 2, the fault propagation deviates from the expected value by 1/4, which will increase the time complexity by 13.3=(0.75−3)3. The complexity of 3.4 has increased by 13.3=0.75−9. The fault propagation in 3 deviates from the expected value by 0.075 and the time complexity increase by 3.49=(0.925−4)4. For Midori-64, the key space of the last round is reduced to 254.01=24×45.33×332.44 using the single fault injection. The time complexity of the attack is reduced from 280 to 228 = 6×216×13.3×13.3×3.49, 232 and 256. Tor Midori-128, the time complexity is slightly increased from 232 to 232.12=6×224×13.3×3.49.
5.
Conclusions
In this paper, we present a 4-round cell-oriented fault propagation trail and categorize the single fault injection approachs into 3 types, i.e., exactly known the fault injuction location, faults occurred in an unknown specific cell and injucted random single-cell faults. When the faults occur in the specific cells, we reduced the key space of the last round by 245.71 and 239.86 for Midori-64 and Midori-128. For Midori-64, the complexity of key recovery is reduced from 280 to 228. In the second type of fault injection style, the time complexity is decreased to 232 for Midori-64. For random single-cell faults, the time complexity of our attack is 256 for Midori-64. Moreover, we have also demonstrated that 4 rounds of Midori must be protection to achieve its security.
Algorithm 4 Key Recovery of WK
Input: pairs of correct and/or faulty ciphertexts (Ci, C′i)((1≤i≤6)).
Output: candidate key list L.
1: for1≤i≤6do 2: Call Algorithm 1 with Parameters (Ci,C′i), 4, 7 and obtain temporary list Li.
3: Search the intersection of Li and store them as candidate keys of WK[4] and WK[7].
4: Call Algorithm 1 with Parameters (Ci,C′i), 8, 9 and obtain temporary list Li.
5: Search the intersection of Li and store them as candidate keys of WK[8] and WK[9].
6: Call Algorithm 1 with Parameters (Ci,C′i), 12, 14 and obtain temporary list Li.
7: Search the intersection of Li and store them as candidate keys of WK[12] and WK[14].
8: Call Algorithm 2 with Parameters (Ci,C′i), 1, 2, 3 and obtain temporary list Li.
9: Search the intersection of Li and store them as candidate keys of WK[1], WK[2] and WK[3].
10: Call Algorithm 2 with Parameters (Ci,C′i), 4, 5, 6 and obtain temporary list Li.
11: Search the intersection of Li and store them as candidate keys of WK[4], WK[5] and WK[6].
12: Call Algorithm 2 with Parameters (Ci,C′i), 9, 10, 11 and obtain temporary list Li.
13: Search the intersection of Li and store them as candidate keys of WK[9], WK[10] and WK[11].
14: Call Algorithm 2 with Parameters (Ci,C′i), 12, 13, 15 and obtain temporary list Li.
15: Search the intersection of Li and store them as candidate keys of WK[12], WK[13] and WK[15].
16: end for
Use of AI tools declaration
The authors declare that they have not used Artificial Intelligence (AI) tools in the creation of this article.
Acknowledgments
This work was supported by Science and Technology Projects of State Grid Corporation of China (Research on Lightweight Secure Connection Technologies for Electric Low Power Wireless Sensor Network with both Wide-band and Narrow-band Terminals, No. 5500-202158416A-0-0-00).
Conflict of interest
The authors declare that there are no conflicts of interest.
References
[1]
M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, et al., Gene ontology: tool for the unification of biology, Nature genetics, 25 (2000), 25–29. https://doi.org/10.1038/75556 doi: 10.1038/75556
[2]
S. Bengio, K. Dembczynski, T. Joachims, M. Kloft, M. Varma, Extreme Classification (Dagstuhl Seminar 18291), Dagstuhl Reports, 8 (2019), 62–80.
[3]
K. Bhatia, K. Dahiya, H. Jain, P. Kar, A. Mittal, Y. Prabhu, M. Varma, The extreme classification repository: Multi-label datasets and code, 2016.
[4]
F. Charte, A. Rivera, M. J. del Jesus, F. Herrera, A. Troncoso, H. Quintián, E. Corchado, On the impact of dataset complexity and sampling strategy in multilabel classifiers performance, Hybrid Artificial Intelligent Systems, (2016), 500–511. Springer International Publishing. https://doi.org/10.1007/978-3-319-32034-2_42
[5]
A. De Myttenaere, B. Golden, B. Le Grand, F. Rossi, Mean absolute percentage error for regression models, Neurocomputing, 192 (2016), 38–48. https://doi.org/10.1016/j.neucom.2015.12.114 doi: 10.1016/j.neucom.2015.12.114
[6]
F. Florez-Revuelta, Evosplit: An evolutionary approach to split a multi-label data set into disjoint subsets, Applied Sciences, 11 (2021), 2823. https://doi.org/10.3390/app11062823 doi: 10.3390/app11062823
[7]
M Merrillees, L Du, Stratified Sampling for Extreme Multi-Label Data, Pacific-Asia Conference on Knowledge Discovery and Data Mining, (2021), 334–345. https://doi.org/10.1007/978-3-030-75765-6_27
K. Sechidis, G. Tsoumakas, I. Vlahavas, On the stratification of multi-label data, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, (2011), 145–158. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-23808-6_10
[10]
P. Szymański, T. Kajdanowicz, A scikit-based Python environment for performing multi-label classification, arXiv e-prints, 2017.
[11]
P. Szymański, T. Kajdanowicz, A network perspective on stratification of multi-label data, Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications, volume 74 of Proceedings of Machine Learning Research, (2017), 22–35.
P. Törönen, A. Medlar, L. Holm, Pannzer2: a rapid functional annotation web server, Nucleic acids res., 46 (2018), W84–W88. https://doi.org/10.1093/nar/gky350 doi: 10.1093/nar/gky350
[14]
G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, I. Vlahavas, Mulan: A java library for multi-label learning, J. Mach. Learn. Res., 12 (2011), 2411–2414.
N. Zhou, Y. Jiang, T. R. Bergquist, A. J. Lee, B. Z. Kacsoh, A. W. Crocker, K. A. Lewis, G. Georghiou, et al., The cafa challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens, Genome biol., 20 (2019), 1–23.
Henri Tiittanen, Liisa Holm, Petri Törönen. Novel split quality measures for stratified multilabel cross validation with application to large and sparse gene ontology datasets[J]. Applied Computing and Intelligence, 2022, 2(1): 49-62. doi: 10.3934/aci.2022003
Henri Tiittanen, Liisa Holm, Petri Törönen. Novel split quality measures for stratified multilabel cross validation with application to large and sparse gene ontology datasets[J]. Applied Computing and Intelligence, 2022, 2(1): 49-62. doi: 10.3934/aci.2022003
Figure 1. Examples of stratified multilabel cross validation with splits of different quality. 1) presents a target data with 24 data points and three classes. This could be a subset of a sparse, high dimensional dataset. 2) Next, the data is split into four cross validation folds in two ways, one representing a random split and the other representing a stratified split optimized over all classes. The ideal distribution of the positive data points in each fold would be 2/6 for the first class and 1/6 for the second and third class so that they would follow the class distributions of the whole data. Random split can result, like in here, in very unbalanced folds, while the stratified split follows closely the data distribution for all classes. 3) Here, the well-balanced stratified split is rearranged for increased clarity
Figure 2. A graphical comparison of the LD and rLD measures within one CV-fold. The measures are evaluated against the increasing class size (Di) on the X axis. The fold size is set to be 20% so that Sij=0.2∗Di would be the perfect split. The curves in A and B show the outcomes from LD and rLD for various ratios of positive data points in the fold over different class sizes. The curves in C and D show the resulting differences in LD or rLD when the class proportion changes between two curves in the upper plots. Notice that these resulting differences have a clear trend with LD in C and stable behavior with rLD in D. So rLD is clearly balanced across the class sizes and LD is not
Figure 3. Behavior of cross validation split quality measures on synthetic data with cross validation splits of varying quality. The cross validation folds have the same distribution for each class, so the scores should be equal. However, LD score depends on the class size if the folds are not perfectly distributed, while rLD and DCP correctly give the same weight for all classes. This is in agreement with the results shown in Figure 2