1.
Introduction
Disease is one of the greatest threats to human health. Many people die due to various diseases each year. With the development of medical science, several efforts have been given to various diseases [1,2,3], with an ultimate purpose of uncovering the underlying mechanism and the essential characteristics of different diseases to design efficient treatments. To date, several treatments have been designed to deal with different diseases, and drug is deemed as one of the most important parts. For fast indication of the utility of drugs, they are divided into several classes. Drugs in one class always have some common features. Correctly identifying the class of a candidate drug-like compound is helpful to uncover its effects. The classification procedure could be completed by traditional experiments. However, such methods are always inefficient and expensive. Designing quick and cheap methods that could accurately predict the class of a given drug is urgent. With the development of computer science, many advanced computational methods have been proposed in recent years and applied to deal with various practical problems, such as artificial neural networks [4,5,6], feature selection algorithms [7], statistical test methods [8,9,10]. These methods provide strong support for designing efficient methods.
Designing computational methods to classify drugs have recently become relatively popular. The most representative studies focused on predicting drug classes in the Anatomical Therapeutic Chemical (ATC) classification system. Such system is recommended and maintained by the World Health Organization (WHO). It has five levels, and several classes are contained in each level. Many computational methods have been proposed to predict classes in the first level of the ATC classification system. Most of them were based on machine learning algorithms, such as deep learning algorithms [11,12,13,14], network embedding algorithms [15], and multi-label classification algorithms [15,16,17]. Meanwhile, several types of drug properties were adopted to build methods that include drug fingerprints [16,17] and chemical–chemical similarity/interaction information [13,18,19,20,21]. These methods provide helpful insights into investigating other drug classification systems.
The Kyoto Encyclopedia of Genes and Genomes (KEGG) [22] recently reported a new drug classification system that classified drugs into the following 14 classes: (Ⅰ) anti-allergic agent; (Ⅱ) anti-inflammatory; (Ⅲ) antibacterial; (Ⅳ) antidiabetic agent; (Ⅴ) antifungal; (Ⅵ) antineoplastic; (Ⅶ) antiviral; (Ⅷ) cardiovascular agent; (Ⅸ) endocrine and hormonal agent; (Ⅹ) gastrointestinal agent; (Ⅺ) hypolipidemic agent; (Ⅻ) immunological agent; (${\rm{XIII}}$) neuropsychiatric agent; (${\rm{XIV}}$) ophthalmic agent. Under this system, drugs are roughly classified in accordance with the diseases they could treat. Identifying the class of any latent drug-like compound is helpful to preliminarily determine its effects to a certain type of disease. Targeted experiments for this type of disease could subsequently be conducted for further confirmation, which could accelerate the procedures for discovering novel drugs. Thus, for the first time, such a drug classification system was investigated in this study by developing a classification model, which could assign one of the above 14 classes to given drugs. A large drug network was constructed on the basis of chemical–chemical interaction (CCI) information reported in Search Tool for Interactions of Chemicals (STITCH) to extract informative drug features [23,24]. It was fed into a powerful network embedding algorithm called Node2vec [25] to produce drug features. These features were quite different from traditional drug fingerprint features, which could be obtained by considering each drug individually. The features adopted in this study abstracted drug associations, and they could overview each drug with all drugs as background, thus representing drugs at a system level. They were learnt by support vector machine (SVM) [26] or random forest (RF) [27] to construct the model. The cross-validation results indicated the good performance of the model. The model was also superior to those with classic drug fingerprint features or embedding features yielded by another network embedding algorithm. The proposed model adopted synthetic minority oversampling technique (SMOTE) [28] to tackle the imbalance problem, and the results suggested that it could balance the performance across different classes.
2.
Materials and methods
2.1. Dataset
Drugs and their information were retrieved from KEGG DRUG (https://www.genome.jp/kegg/drug/) [22]. According to https://www.kegg.jp/brite/br08332 (accessed on 10 January 2022), 820 drugs encoded by KEGG IDs were classified into 14 classes, as mentioned in Section 1 and listed in column 2 of Table 1. For convenience, these classes were tagged as C1–C14. The correspondence between tags and classes could be found in columns 1 and 2 of Table 1. Among these drugs, only two belong to two classes, while the others are in one exact class. Thus, these two drugs were removed. As the CCI information was employed to build the model, where drugs are represented by PubChem IDs, the KEGG IDs with unavailable PubChem IDs were also removed. If the PubChem ID of one KEGG ID was not included in the drug network (Section 2.2), it was also discarded. Finally, 579 drugs encoded by KEGG IDs were obtained. These drugs were classified into 14 classes, as listed in column 2 of Table 1. The number of drugs in each class is also listed in Table 1. The detailed drugs in each class are provided in S1.
2.2. Construction of drug network
Recently, network is a popular form to investigate various complex biological and medical problems [15,29,30,31], because it could overview all objects (nodes in the network) and evaluate them at a system level. In the present study, network was utilized to organize drugs.
The constructed network defined drugs reported in KEGG as nodes. The relationship between nodes should be determined to form the network. Several public databases provide the existing associations of drugs. Here, the CCI information reported in STITCH (http://stitch.embl.de/, version 4.0) was employed [23,24]. The file named "chemical_chemical.links.v4.0.tsv.gz" was downloaded from STITCH, which contains a large number of CCIs. Each CCI consists of two chemicals, represented by PubChem IDs, and one confidence score. The confidence score is obtained by measuring several aspects of chemicals, such as structures, reactions, activities, and literature descriptions. It ranges between 1 and 999 with a meaning that the higher the confidence score, the stronger the association between two chemicals. For formulation, the confidence score on the CCI between chemicals c1 and c2 was denoted as S(c1, c2). For the constructed network, two nodes were connected by an edge if their corresponding drugs could comprise a CCI with a confidence score larger than zero. Evidently, each edge in the network indicated a CCI. Each edge was assigned a weight, which was defined as S(c1, c2)/1000, to reflect different strengths of CCIs. Such network was denoted by Nd, and it contained 17,956 nodes and 3,134,797 edges.
2.3. Drug network embedding features
Encoding samples into continuous vectors that could be processed by most computer algorithms is one of the most important steps to build efficient classification models. As mentioned in Section 2.2, a drug network was constructed, which contained informative associations of drugs. This information could be used to encode drugs. In recent years, several network embedding algorithms, such as DeepWalk [32], Mashup [33], and LINE [34], have been proposed to assign continuous vectors to nodes in one or more networks. Here, Node2vec [25] was selected to extract drug features from Nd.
Node2vec is a powerful network embedding algorithm. In fact, it could be deemed as an improved version of DeepWalk. Similar to DeepWalk, it samples many paths from a given network. The procedures of generating paths are greatly improved compared with those in DeepWalk. For each node u, Node2vec produces a predefined number of paths starting at u. Suppose that ni-1 is the (i-1)-th node in one path starting at u. This path is extended to the i-th node, denoted by ni, by selecting one neighbor of ni-1. The selection probability from ni-1 to any other node is defined as
where $ {\pi _{vw}} $ stands for the transition probability from v to w, defined by $ {\pi _{vw}} = {\alpha _{pq}}(t, w) \cdot {w_{vw}} $; Z is the sum of transition probabilities from v to all its neighbors; and $ {w_{vw}} $ denotes the weight on edge connecting nodes v and w. Moreover, $ {\alpha _{pq}}(t, w) $ could be determined by
where t is the (i-2)-th node in the path, and dtw represents the distance between t and w. For the path starting at u, it is iteratively extended by selecting a neighbor of the current end-point in accordance with the probabilities calculated using Eq (1) until the length of the path reaches the predefined length. After all paths are produced, they are fed into the word2vec algorithm with SkipGram to produce vectors of nodes, where the paths are deemed as sentences and nodes are considered as words.
In this study, the program of Node2vec was retrieved from https://snap.stanford.edu/node2vec/. Such program was performed on Nd by using its default parameters. Afterwards, the feature vectors of 579 drugs were picked up to construct models.
2.4. Classification algorithms
Besides the representation of samples, another key step to build efficient classification models is to select a proper classification algorithm. In this study, two classic classification algorithms were used: SVM [26] and RF [27]. These two algorithms have been widely used to set up classification models with excellent performance for investigating various biological [35,36,37,38,39] and medical problems [40,41,42,43].
SVM is a statistical learning-based classification algorithm. Its original version could only process binary classification problems. Its idea is to determine an optimal hyperplane that could separate samples into two classes as perfect as possible. However, such hyperplane is not easy to obtain in the original feature subspace. It generally employs kernel trick to map samples into a high-dimension space, in which the hyperplane, designed to separate samples in two classes, could be easily constructed. A test sample is also mapped into the high-dimension space. The class of the test sample is determined in accordance with the side of the hyperplane it is located. Subsequently, SVM could tackle multi-class classification problems by using "one-versus-rest" or "one-versus-one".
RF is another powerful and widely used classification algorithm, which is quite different from SVM. In fact, it is a type of ensemble algorithm. Several decision trees are constructed and comprise RF. Some features are randomly selected from all features to learn a decision tree. Samples are randomly selected, with replacement, from the original dataset. Then, a decision tree is built on the basis of these randomly selected features and samples. RF gives the prediction by majority voting on predictions generated by the decision trees it contains.
In this study, the tools "SMO" and "RandomForest" in Weka were directly used [44], and they implemented the above mentioned SVM and RF, respectively. These tools were used with their default parameters.
2.5. Synthetic minority oversampling technique
A total of 14 drug classes were involved in this study. Some classes contained much less samples than the others, implying the dataset was imbalanced. The classification model directly built on such dataset may provide excellent performance on the majority class and poor performance on the minority class. SMOTE [28] was employed to tackle such problem. For each class, except the largest, one sample is randomly selected, denoted by x. The k nearest neighbors of x in the same class are found, and one is randomly selected, say y. On the basis of x and y, a new sample z is generated, which is defined as the linear combination of x and y with randomly produced combination coefficients. Given that the new sample is highly related to x and y, it is placed into the same class of x and y. The above procedures are executed several times until the size of the minority class is the same as that of the largest class.
This study adopted the tool "SMOTE" in Weka [44] to balance the investigated dataset, and default parameters were used.
2.6. Performance assessment
As a multi-class classification problem, the performance of the models could be measured by overall accuracy (ACC), which is defined as the proportion of all correctly predicted samples.
Recall, precision, F-measure, and Matthew's correlation coefficient (MCC) [45] were also computed for each class as follows:
where TP, FP, FN, and TN stand for true positive, false positive, false negative, and true negative, respectively. In multi-class classification, their specific definitions for one class are as follows: TP is the number of samples in the class that are also classified into this class, FP is the number of samples not in the class that are classified into this class, FN is the number of samples in the class that are classified into other classes, and TN is the number of samples not in the class that are classified into other classes. Eqs (3)–(6) could only assess the performance under a certain threshold. For full assessment, the receiver operating characteristic (ROC) and precision-recall (PR) curve analyses were further employed on the predicted results for each class. The area under the ROC curve, which was denoted by AUROC, and the PR curve, which was represented by AUPR, were calculated to fully evaluate the performance of different models on each class. For easy description, the six measurements for class Ci were denoted by precision(i), recall(i), F-measure(i), MCC(i), AUROC(i), and AUPR(i). Furthermore, the weighted precision, recall, F-measure, MCC, AUROC, and AUPR were calculated to fully evaluate the overall performance of all classification models.
3.
Results
In this study, a novel model was proposed to classify drugs in accordance with the drug classification system recently reported in KEGG. Its construction and evaluation procedures are illustrated in Figure 1. In this section, detailed evaluation results were provided, and further comparisons were conducted.
3.1. Performance of the proposed model
The proposed model adopted the drug features derived from a large drug network via Node2vec and SMOTE to tackle the imbalance problem. The best dimension of the features was found to be 128. Two classic classification algorithms (RF and SVM) were used as the prediction engine. For easy description, if RF (SVM) was selected as the prediction engine, the model was called RF (SVM) model. Each model was evaluated five times using a 10-fold cross-validation [46]. The confusion matrix of each model under each 10-fold cross-validation is provided in S2. The average performance is listed in Table 2.
For the RF model, the ACC was 0.9486, and the weighted precision, F-measure, MCC, AUROC, and AUPR were 0.9631, 0.9511, 0.9484, 0.9958, and 0.9830, respectively. The result of each measurement was relatively high, suggesting the excellent performance of the RF model. The model's performance on 14 classes is provided in Table 3, and the ROC and PR curves under one 10-fold cross-validation are shown in Figure 2(A), (B), respectively. Table 3 shows that most measurements were higher than 0.9. The number of classes on which the recall values were higher than 0.9 was 12, and this numbers for precision, F-measure, MCC, AUPRO, and AUPR were 11, 12, 12, 14, and 13, respectively. The results indicated the good performance of the RF model on most classes, conforming to its overall performance. Careful examination of the measurements in Table 3 showed that the RF model provided the lowest performance on class C12 (immunological agent) for each measurement. For example, the recall on this class was only 0.2726. This class only contained eight drugs, which was not only the smallest among all 14 classes, but was considered the lowest performing class. Although the RF model adopted SMOTE to tackle the imbalance problem, the class sizes could still have influence. In the following text, we would elaborate that such influence has been decreased by SMOTE.
As for the SVM model, the ACC reached 0.9583, which was slightly higher than that of the RF model. As for the other five measurements, they were 0.9752, 0.9640, 0.9625, 0.9830 and 0.9350 (Table 2), respectively. Compared with the measurements of RF model, SVM model yielded higher weighted precision, F-measure, MCC, whereas lower weighted AUROC and AUPR. These indicated that the RF and SVM model provided almost equal performance. The detailed performance of SVM model on fourteen classes is listed in Table 4. Furthermore, the ROC and PR curves of this model under one 10-fold cross-validation are illustrated in Figure 2(C), (D), respectively. Similar to the performance of RF model, most measurements listed in this table were higher than 0.9. The numbers of classes on which recall, precision, F-measure, MCC, AUROC and AUPR, respectively, were higher than 0.9 were 12, 13, 13, 13, 14 and 11, which were also similar to those of the RF model. Again, the performance of SVM model on the smallest class C12 (immunological agent) was also lowest for each measurement.
According to Table 2, the SVM model provided higher performance on four measurements than the RF model, whereas the RF model yielded higher performance on the other two measurements. The SVM model was slightly superior to the RF model. Several box plots were drawn on all measurements for the performance of the two models on 14 classes, as shown in Figure 3. For recall, the SVM model varied in a smaller range than the RF model. The same case occurred for F-measure, MCC, and AUROC. As for the remaining two measurements (precision and AUPR), the RF model changed in a smaller interval. On the whole, the stability of the performance of these two models on different classes was at the same level. However, the SVM model was slightly more stable. These results showed that the SVM model is a more suitable tool for classifying drugs than the RF model.
3.2. Comparison of the model without SMOTE
In this study, SMOTE was adopted to tackle the imbalance problem. The performance of the model without SMOTE must be investigated and compared with that of the model with SMOTE. In view of this, RF and SVM were directly applied on the features yielded by Node2vec to build the RF and SVM models without SMOTE. Their performance was also evaluated five times using a 10-fold cross-validation. The average performance is listed in Table 2. For the RF model without SMOTE, the ACC, weighted AUROC, and AUPR were 0.9410, 0.9930, and 0.9710, respectively. As no drugs were classified into C12, the precision, F-measure, and MCC for this class could not be computed, further inducing no results for weighted precision, F-measure, and MCC. According to the computed measurements, the performance of this model was only slightly lower than that of the RF model with SMOTE. As for the SVM model without SMOTE, the six measurements were 0.9600, 0.9610, 0.9600, 0.9550, 0.9840, and 0.9320, which were relatively similar to those of the SVM model with SMOTE. This finding suggested that the employment of SMOTE did not improve the overall performance of the model at all. However, the utilization of SMOTE was reflected not only on the improvement of the overall performance of the models but also on the balance of performance across different classes. Thus, the accuracy, i.e., recall, on the 14 classes yielded by the models with or without SMOTE was further investigated. A box plot was drawn for each model to clearly show the change in recall on all classes, as shown in Figure 4. When the same prediction engine was used (RF or SVM), the recall values of the model with SMOTE varied in a smaller interval than those of the model without SMOTE. This result implied that the employment of SMOTE balanced the performance across different classes. For example, on C12, the RF and SVM models without SMOTE produced recall values of 0.0000 and 0.5000, respectively, and these values were improved to 0.2726 and 0.7373, respectively, by employing SMOTE. For practical applications, the model with SMOTE could avoid the extremely good performance on the majority class and extremely low performance on the minority class.
3.3. Comparison of models using other drug features
In this study, the drug features derived from a network via Node2vec were used to build the model. Two other feature types were adopted to build models, which were compared with the proposed model.
The first feature type contained features derived from the drug network Nd via another powerful network embedding algorithm, Mashup [33]. The dimension was also set to 128. RF and SVM were applied on these features to construct models. SMOTE was employed to tackle the imbalance problem. The models were also assessed five times using a 10-fold cross-validation. For clear description, they were called Mashup-based RF and SVM models, whereas the proposed models were termed as Node2vec-based RF and SVM models. The average performance of the Mashup-based models is provided in Table 5. When RF was selected as the prediction engine, the Node2vec-based model provided evidently higher performance on all measurements than the Mashup-based model. In detail, the improvement on ACC, weighted F-measure, MCC, and AUPR was about 0.05 and such improvement on weighted precision and AUROC was about 0.03 and 0.01, respectively. Meanwhile, the superiority of the Node2vec-based SVM model to the Mashup-based SVM model was not very evident. However, the Node2vec-based SVM model provided higher performance on five measurements, implied that it was slightly better than the Mashup-based model.
The second feature type included the drug fingerprint features, which were widely used to deal with various drug- or chemical-related problems [16,38,47,48,49]. Here, RDKit [50] was adopted to extract the ECFP fingerprints of all 579 drugs. The fingerprints of each drug were represented by a binary vector with dimension of 1024. These fingerprint features were fed into RF and SVM to construct models. SMOTE was also employed. The models were called fingerprint-based models. All models were evaluated five times using a 10-fold cross-validation. The average performance is provided in Table 5. The results showed that most measurements were lower than 0.9, indicating low performance of such models. Compared with the performance of the proposed model (Table 2), the proposed model clearly provided much higher performance.
Finally, the ROC and PR curves on each class for some models were plotted. For the Node2vec-based model, the SVM model was selected because it was slightly better and more stable than the RF model. For the Mashup-based model, the SVM model was chosen as it was superior to the RF model (Table 5). As for the fingerprint-based model, the RF model was selected due to its superiority to the SVM model (Table 5). The ROC and PR curves of these models on all classes are shown in Figure 5. A notable detail that these curves were produced by one of the five 10-fold cross-validations. The areas under the ROC curves of the fingerprint-based RF model were much smaller than those of other two models. The same results were obtained for PR curves. As for the Node2vec-based and Mashup-based SVM models, the areas under the ROC or PR curves were relatively similar. However, the areas under the ROC or PR curves of the Mashup-based SVM model on some classes (e.g., C12) were relatively low, whereas they were improved by the Node2vec-based SVM model.
Therefore, the proposed model (Node2vec-based model) was better than the models using other drug features, indicating its superiority in classifying drugs.
3.4. Superiority of features derived from the drug network
In Section 3.3, the proposed model and two other models with different drug features were compared. The proposed model provided better performance than the two other models, suggesting that the features used in the proposed model were more efficient than those adopted in other models for drug classification. The t-SNE [51] was adopted to analyze three feature types used in the Node2vec-based, Mashup-based, and fingerprint-based models to further confirm these findings and provide intuitive evidence. Such tool could reduce the dimensions of features to two and display them in a 2D space, where samples in different classes are in different colors. The results of t-SNE are illustrated in Figure 6. The features derived from the drug network [Figure 6(A),(B)] could cluster drugs in different classes evidently better than the drug fingerprints [Figure 6(C)]. As for two feature types derived from the drug network, the features yielded by Node2vec [Figure 6(A)] could cluster drugs in different classes more compactly than those generated by Mashup. These results implied that the features derived from the drug network via Node2vec, which were used in the proposed model, were more helpful in classifying drugs. This finding was the main reason why the Node2vec-based model could provide better performance.
4.
Conclusions
In this study, a new drug classification system reported in KEGG was investigated by proposing classification models for assigning class in this system to any given drug. Informative drug features were derived from a large drug network via a powerful network embedding algorithm. The evaluation and comparison results indicated that the features were more related to the drug classification system in KEGG and more informative than classic fingerprint features. Employing more drug information (e.g., drug side effects and indications) or more complex and advanced feature learning algorithms (e.g., graph convolutional network and convolutional neural network) may be helpful to access more informative drug features, thereby building more powerful classification models. These aspects could be investigated in future work. Furthermore, the employment of SMOTE guaranteed that the performance on all classes was high. The model could be a useful tool to classify drugs, discover novel effects of existing drugs and determine the effects of candidate drug-like compounds.
Conflict of interest
The authors declare there is no conflict of interest.