
This study investigates the effect of the COVID-19 pandemic on the residential real estate prices in Turkey. This study indicates the effect of COVID-19, loan package, macroeconomic and behavioral control variables on abnormal returns of residential real estate prices during the event window. This study consists of three econometric steps. Firstly, the abnormal returns of the residential real estate prices are obtained by using an event study. Secondly, the effect of the COVID-19 pandemic on abnormal returns of residential real estate prices was estimated by panel data analysis for regional and city levels. According to the findings of the city level, the COVID-19 pandemic has a negative effect on abnormal returns of residential prices, as expected. However, the regional analysis shows mainly a positive COVID-19 effect.
Citation: Selahattin Kaynak, Aykut Ekinci, Havvanur Feyza Kaya. The effect of COVID-19 pandemic on residential real estate prices: Turkish case[J]. Quantitative Finance and Economics, 2021, 5(4): 623-639. doi: 10.3934/QFE.2021028
[1] | Saranya Muniyappan, Arockia Xavier Annie Rayan, Geetha Thekkumpurath Varrieth . DTiGNN: Learning drug-target embedding from a heterogeneous biological network based on a two-level attention-based graph neural network. Mathematical Biosciences and Engineering, 2023, 20(5): 9530-9571. doi: 10.3934/mbe.2023419 |
[2] | Huiqing Wang, Sen Zhao, Jing Zhao, Zhipeng Feng . A model for predicting drug-disease associations based on dense convolutional attention network. Mathematical Biosciences and Engineering, 2021, 18(6): 7419-7439. doi: 10.3934/mbe.2021367 |
[3] | Chenhao Wu, Lei Chen . A model with deep analysis on a large drug network for drug classification. Mathematical Biosciences and Engineering, 2023, 20(1): 383-401. doi: 10.3934/mbe.2023018 |
[4] | Bo Zhou, Bing Ran, Lei Chen . A GraphSAGE-based model with fingerprints only to predict drug-drug interactions. Mathematical Biosciences and Engineering, 2024, 21(2): 2922-2942. doi: 10.3934/mbe.2024130 |
[5] | Xianfang Wang, Qimeng Li, Yifeng Liu, Zhiyong Du, Ruixia Jin . Drug repositioning of COVID-19 based on mixed graph network and ion channel. Mathematical Biosciences and Engineering, 2022, 19(4): 3269-3284. doi: 10.3934/mbe.2022151 |
[6] | Meng Jiang, Bo Zhou, Lei Chen . Identification of drug side effects with a path-based method. Mathematical Biosciences and Engineering, 2022, 19(6): 5754-5771. doi: 10.3934/mbe.2022269 |
[7] | Peter Hinow, Edward A. Rietman, Sara Ibrahim Omar, Jack A. Tuszyński . Algebraic and topological indices of molecular pathway networks in human cancers. Mathematical Biosciences and Engineering, 2015, 12(6): 1289-1302. doi: 10.3934/mbe.2015.12.1289 |
[8] | Cristian Tomasetti, Doron Levy . An elementary approach to modeling drug resistance in cancer. Mathematical Biosciences and Engineering, 2010, 7(4): 905-918. doi: 10.3934/mbe.2010.7.905 |
[9] | Rishin Haldar, Swathi Jamjala Narayanan . A novel ensemble based recommendation approach using network based analysis for identification of effective drugs for Tuberculosis. Mathematical Biosciences and Engineering, 2022, 19(1): 873-891. doi: 10.3934/mbe.2022040 |
[10] | Lili Jiang, Sirong Chen, Yuanhui Wu, Da Zhou, Lihua Duan . Prediction of coronary heart disease in gout patients using machine learning models. Mathematical Biosciences and Engineering, 2023, 20(3): 4574-4591. doi: 10.3934/mbe.2023212 |
This study investigates the effect of the COVID-19 pandemic on the residential real estate prices in Turkey. This study indicates the effect of COVID-19, loan package, macroeconomic and behavioral control variables on abnormal returns of residential real estate prices during the event window. This study consists of three econometric steps. Firstly, the abnormal returns of the residential real estate prices are obtained by using an event study. Secondly, the effect of the COVID-19 pandemic on abnormal returns of residential real estate prices was estimated by panel data analysis for regional and city levels. According to the findings of the city level, the COVID-19 pandemic has a negative effect on abnormal returns of residential prices, as expected. However, the regional analysis shows mainly a positive COVID-19 effect.
Diseases are a core issue of human health, has and have existed since the emergence of human beings. A large number of people die from various diseases every year. The problem of how to treat different diseases is always a hot topic in medical science. Numerous efforts have been made over the years, especially in the past 100 years. Plenty of schemes have been designed to treat diseases. Among them, drug is considered to be one of the effective ways. However, it is not easy to develop a new drug, which always requires some rigorous and complex steps; drug development is a long procedure and is always very expensive. According to relevant reports, the average time for designing a new drug is about 10-15 years [1] and can cost up to 802 million dollars [2]. Although the investment on drug development has sharply increased in these years, the number of drug approvals each year remains quite low. Designing new techniques for accelerating the procedures of drug development remains quite urgent.
Drug repositioning is deemed as an alternative pipeline, which can promote drug development procedures. For most existing drugs, our cognizance is not very complete, as some latent effects have not been discovered. The purpose of drug repositioning is to discover the latent effects of existing drugs, thereby discovering the new diseases that these drugs can treat. Because numerous clinical tests have been conducted on existing drugs, the launch of these drugs for treating new diseases can be evidently accelerated. However, it is still laborious to find out and confirm the new effects of existing drugs. The use of computational methods is an effective alternative procedure, which has become quite popular [3,4,5,6,7].
In recent years, lots of computational methods have been designed for drug repositioning. Many of these methods focused on the prediction of drug-disease associations (DDAs). The validated DDAs were deeply analyzed and some special patterns were discovered, which can be used to identify latent DDAs. Previous studies have modeled such prediction problems as a recommender system [8,9,10,11,12]. These methods always set up one or more kernels of drugs, diseases, or drugs and diseases. Some complex fusion procedures were applied to these kernels, thereby scoring each pair of drugs and diseases. Network-based methods are another group for predicting DDAs. They always construct multiple networks, which contain not only drugs and diseases but also other related objects, such as long non-coding RNAs (lncRNAs), micro RNAs (miRNAs), target proteins, etc. Among these methods, some can be directly applied to networks to make predictions [13,14,15,16,17,18,19], whereas others can adopt networks to access features of drugs and diseases, and the downstream classification algorithms complete the prediction task [20,21,22,23,24,25]. Recently, deep learning algorithms have been used to construct methods for predicting DDAs, such as convolutional neural networks [26] and graph convolutional networks [27,28]. Several computational methods designed binary classifiers to predict DDAs. The validated DDAs were retrieved from public databases, which were termed as positive samples. However, the selection of negative samples was a problem. Some studies adopted random selection to pick up the same number of negative samples from unlabeled pairs of drugs and diseases [21,26,27]. As latent DDAs may be included, such selection may lead to an unstable decision boundary of the following constructed classifiers. This study gave a contribution in this regard.
In this study, a novel reliable negative selection scheme, named RNSS, was proposed, which can help us select reliable negative samples (i.e., pairs of drugs and diseases with low probabilities of being actual DDAs). Such a scheme employed the k-neighbors of a drug in a drug network and evaluated the relationships between one drug and one disease based on the relationships between k-neighbors and the disease. Three classic classification algorithms were adopted to construct classifiers using negative samples generated by an RNSS: random forest (RF) [29], bayes network (BN), and the nearest neighbor algorithm (NNA) [30]. The results indicated that the classifiers with the RNSS provided a nearly perfect performance and were much better than those with traditional negative selection schemes, indicating that the RNSS can genuinely screen out reliable negative samples.
The validated DDAs were directly accessed from a previous study [21]. These interactions were extracted from chemical-disease interactions collected in the Comparative Toxicogenomics Database (CTD) (http://ctdbase.org) [31,32,33]. In detail, the file "CTD_chemicals_diseases.csv.gz" in CTD was downloaded, from which the chemical-disease associations with "DirectEvidence" were extracted. Then, the chemical-disease associations without DrugBank IDs were discarded. Approximately 63, 472 associations remained, which were deemed as positive samples in this study. Approximately 2, 794 drugs (represented by DrugBank IDs) and 3, 019 diseases (represented by MESH or OMIM identifiers) were involved in the positive samples.
Generally, negative samples were necessary to build the binary classification model. However, the selection of negative samples was a challenging problem as the unlabeled pairs of drugs and diseases may be latent DDAs. Some previous studies adopted random selection to construct negative samples. Here, we proposed a scheme to select reliable negative samples. Based on such a scheme, negative samples that were one, two, and three times as many positive samples were selected and combined with positive samples to constitute datasets.
In this study, we tackled the prediction of DDAs by modeling a binary classification problem. From the CTD, the validated positive samples were obtained as mentioned in Section 2.1. However, there are no public databases that collect the negative samples for DDAs due to a lack of their application values [26]. Several previous studies adopted a random selection scheme to generate negative samples, which may induce an unstable decision boundary of the classifier [34]. Thus, it is necessary to design an efficient scheme to select reliable negative samples. These samples should have a low likelihood of being actual DDAs. This section introduced a novel scheme to select reliable negative samples, named RNSS. Its procedures are described below.
First, a drug network was constructed, which defined 2, 794 drugs as nodes. The associations between drugs should be determined, thereby defining the edges within the network. It is known that the Simplified Molecular Input Line Entry System (SMILES) [35] format is the most widely used representations of drugs, from which the fingerprints of drugs can be extracted. Here, the RDKit (http://www.rdkit.org/) was adopted to extract the extended-connectivity fingerprint (ECFP) of each investigated drug. For drug p, its fingerprints constitute a set, denoted by F(p). The Tanimoto coefficient was applied on the fingerprint sets of two drugs, p1 and p2, to measure their associations, formulated by
Qf(p1,p2)=|F(p1)∩F(p2)||F(p1)∪F(p2)|. | (1) |
Two nodes in the network were connected by an edge if and only if the association between their corresponding drugs was larger than zero. In addition, each edge e was assigned a weight, denoted by w(e), which was the association between two drugs. Such a network is denoted by Wd.
It is known that drugs in similar structures are more likely to treat similar diseases. In view of this, the relationships between a drug, denoted by pi, and a disease, denoted by qj, can be measured by the relationships between similar drugs of pi and qj. In Wd, these drugs are the direct neighbors of pi, (i.e., the 1-neighbors of pi). Furthermore, the drugs with a distance two to pi (2-neighbors of pi) may also provide contributions, also for the k-neighbors of pi (k > 2). In view of this, the k-neighbors of pi was picked up from Wd, denoted by Nk(pi), which consisted of drugs with distance k to pi. For each drug p in N1(pi), the association between it and pi (i.e., the weight on the edge connecting them) can be directly used to measure the relationship between pi and qj. However, it is problematic to utilize drugs in Nk(pi) (k > 1), as these drugs have no direct associations with pi. To settle such a problem, the weight of a path must be defined. For a path P with length l, containing edges e1,e2,⋯,el, its weight, denoted by w(P), was defined as
w(P)=(∏li=1w(ei))Fdecay(P), | (2) |
where w(ei) represents the weight of edge ei, Fdecay(P) is a decay function, which can increase the influence of path length as the long path indicate weak association between two endpoints, computed by
Fdecay(P)=θ⋅l, | (3) |
where θ is a parameter, which was set to 2.26 as suggested in [36,37,38,39,40]. For each drug p in Nk(pi) (k > 1), its linkage to pi can be measured by the weights of the paths connecting them. If multiple paths connecting them, the maximum path weight was selected. Based on the above definitions, the linkage between pi and drug p in Nk(pi) can be synthesized as follows:
L(pi,p)={Qf(pi,p)p∈N1(pi)max{w(Pi)|i=1,2,⋯,m}p∈Nk(pi)(k>1), | (4) |
where P1,P2,⋯,Pm represent all paths connecting pi and p with length k. In Nk(pi), some drugs can constitute DDAs with qj, whereas others cannot. In view of this, we defined an indictor function as follows:
Δ(p,qj)={1ifpandqjcanconstituteaDDA0otherwise. | (5) |
Then, the relationship between pi and qj can be measured by the following level score:
Score(pi,qj)=∑tk=1∑p∈Nk(pi)L(pi,p)⋅Δ(p,qj). | (6) |
Because it is time-consuming to find all paths connecting two nodes with a long distance, a threshold t was employed in Eq (6). Such a setting was also reasonable because paths with long distances play few or even no contributions to measure the linkage between pi and qj. An example is shown in Figure 1, where the threshold t is set to 2.
It is clear that the high outcome of Eq (6) indicated the strong relationships between the drug and disease. On the contrary, it was almost impossible for the pair of drugs and diseases with a low score to be an actual DDA. These pairs can be high-quality negative samples and may be helpful to construct classifiers with a high performance. In theory, the unlabeled pairs of drugs and diseases with low level scores should be selected as negative samples. Such an operation can be conducted by setting a low threshold s to the level score (i.e., the unlabeled pairs of drugs and diseases with level scores no more than s were selected). These selected negative samples comprised a negative sample pool, denoted by NS(s).
In traditional machine learning, it is very important to encode each sample with its essential properties. In this study, we directly adopted the drug and disease features reported in our previous study [21], which were derived from multiple networks. Networks are recently a popular research form as they can overview each object with all other objects as background. As the procedures of network construction and feature extraction have been described in detail in a previous study [21], we only gave a brief introduction on such procedures.
Twelve drug networks were constructed, where eight contained only drugs, and the other four included other objects. For the former eight networks, they were constructed in terms of drug associations collected from two public databases: KEGG (https://www.genome.jp/kegg/) [41,42] and STITCH (http://stitch.embl.de, version 4.0) [43], and that defined by the Anatomical Therapeutic Chemical (ATC) codes of two drugs. As for the later four networks, they indicated the relationships between drug and one of the following objects: proteins, pathways, side effects, and gene ontology (GO) terms. The target proteins and side effects of drugs were retrieved from DrugBank (https://go.drugbank.com/) [44] and SIDER (http://sideeffects.embl.de/) [45], respectively. In CTD, the related GO terms and pathways for chemicals were also collected, which were picked up for building drug networks.
Three networks were built for diseases, involving basic information of diseases such as pathway, gene, and phenotype information. This information was also sourced from the CTD. According to the related pathways of two diseases, their association was measured by the Tanimoto coefficient of two pathway sets. The disease associations based on gene and phenotype information can be assessed in the same way. Then, three networks were built with these disease associations.
The drug and disease networks mentioned above contained abundant information of drugs and diseases, respectively. Informative drug and disease features can be extracted from them. As multiple networks were constructed for drug and disease, a powerful network embedding algorithm, Mashup [46], was adopted. Its greatest merit is that it can process more than one network. Two stages are contained in this algorithm. In the first stage, the raw feature vector for each node in each network is extracted in terms of a random walk with restart [47,48]. The feature vectors for the same node that are derived from different networks are fused in the second stage. At the same time, the dimension is reduced. Mashup was applied to the twelve drug networks to generate drug features and the disease features were produced from three disease networks. Various dimensions, changing from 50 to 1000, were produced for drugs and diseases. The optimal dimension for drugs and diseases can be determined by a ten-fold cross-validation [49].
In this study, the RNSS was proposed to select reliable negative samples. Three classic classification algorithms were selected to construct models based on positively validated positive samples and selected reliable negative samples, thereby elaborating the utility of an RNSS. These classification algorithms included RF [29], BN and NNA [30], which were also adopted in the previous study [21]. These algorithms were designed using quite different ideas and principles. Their common results can provide a universal significance. For this investigation, if the classifiers with an RNSS were generally better than those without an RNSS or with other negative sample selection schemes for any of these three algorithms, it can prove that an RNSS is an efficient scheme to select high-quality negative samples. To quickly implement the above algorithms, corresponding tools (RandomForest, BayesNet and IBk) in Weka [50] were employed. Their default parameters were adopted because the purpose of this study was to test whether the employment of an RNSS can improve the performance of models rather than to build models with excellent performance.
In this study, a ten-fold cross-validation was adopted to evaluate the performance of all constructed classifiers [49]. Such a method divides samples into ten parts. Each part is singled out as the test set and the rest of the parts constitute the training set. The classifier based on the training set is applied to the test set. In an RNSS, the calculation of the level score is related to the positive samples in the training set. Thus, we first divided the positive samples into ten parts. When one part of the positive samples was singled out, which was put into the test set, we used the rest of the positive samples to compute the level scores of unlabeled samples and then selected negative samples. In this way, the information of the test samples was completely excluded when training the classifiers. It was a rigorous cross-validation.
For the binary classification, plenty of measurements have been designed to evaluate the performance of various models. The direct way to display the predicted results of one model is a confusion matrix, which contains four entries: true positive (TP), false negative (FN), false positive (FP), and true negative (TN). Several measurements can be calculated according to these entries. In this study, we selected sensitivity (SN), specificity (SP), accuracy (ACC), precision, F1-measure [51,52,53,54,55,56,57], and the Matthews correlation coefficient (MCC) [58], which can be computed by
SN=TPTP+FN | (7) |
SP=TNTN+FP | (8) |
ACC=TP+TNTP+FP+TN+FN | (9) |
Precision=TPTP+FP | (10) |
F1−measure=2×Recall×PrecisionRecall+Precision | (11) |
MCC=TP×TN−FP×FN√(TP+FP)×(TP+FN)×(TN+FP)×(TN+FN) | (12) |
Besides, the receiver operating characteristic (ROC) and precision-recall (PR) curves were used to comprehensively evaluate the models' performance. The ROC curve sets the SN as the Y-axis and the false positive rate (i.e., 1-SP) as the X-axis, which are obtained by setting different thresholds. The PR curve is defined in a similar way, which defines recall (i.e., SN) as the X-axis and precision as the Y-axis. The area under the ROC and PR curves are essential measurements to assess the models' performance, which were denoted by AUROC and AUPR, respectively.
In this study, a computation method was proposed to identify DDAs. To enhance the performance of the method, a novel negative sample selection scheme, RNSS, was designed. The classifiers using samples selected by an RNSS were built and evaluated. The entire procedure is illustrated in Figure 2.
When assessing the level score of one drug and one disease, the threshold t (Eq (6)) determined which k-neighbors of the drug were considered. It is clear that neighbors with a long distance to the drug give few contributions. Here, we set such threshold t as two, that is, the direct neighbors and 2-neighbors of the drug were included to assess its associations to diseases. After obtaining the level scores of all unlabeled pairs using Eq (6) with t = 2, we set two thresholds (0.05 and 0.1) to construct two negative sample pools (i.e., NS(0.05) and NS(0.1)). From each pool, the same number of negative samples to the positive samples were randomly selected and combined with positive samples to constitute a balanced dataset. Each sample was represented by drug and disease features derived from multiple drug and disease networks. The dimensions for drug and disease features were set to various values between 50 and 1, 000. All possible dimension combinations were attempted. Three classification algorithms (RF, BN, and NNA) were applied to construct classifiers on balanced datasets. These classifiers were assessed by a ten-fold cross-validation. The best performance (measured by MCC) of the RF, BN, and NNA classifiers using negative samples selected from two pools is listed in Table 1.
Classification algorithm | Negative sample pool | Dimension | SN | SP | ACC | Precision | F1-measure | MCC | |
Drug feature | Disease feature | ||||||||
Random forest | NS(0.05) | 1000 | 1000 | 1.000 | 0.979 | 0.989 | 0.979 | 0.989 | 0.979 |
NS(0.1) | 850 | 900 | 1.000 | 0.953 | 0.975 | 0.950 | 0.974 | 0.951 | |
Bayes network | NS(0.05) | 550 | 50 | 1.000 | 0.979 | 0.989 | 0.979 | 0.989 | 0.979 |
NS(0.1) | 550 | 50 | 1.000 | 0.952 | 0.975 | 0.950 | 0.974 | 0.950 | |
Nearest neighbor algorithm | NS(0.05) | 1000 | 50 | 0.978 | 0.974 | 0.976 | 0.974 | 0.976 | 0.951 |
NS(0.1) | 1000 | 50 | 0.956 | 0.954 | 0.955 | 0.954 | 0.955 | 0.910 |
For the balanced dataset with negative samples selected from NS(0.05), the RF classifier yielded an MCC of 0.979, which was very high. The other five measurements (SN, SP, ACC, precision, and F1-measure) were 1.000, 0.979, 0.989, 0.979, and 0.989, respectively. Such a performance suggested that the RF classifier can give a nearly perfect prediction. As for the BN and NNA classifiers, they also produced a high performance. The MCC values of these two classifiers were 0.979 and 0.951, respectively. Considering that BN and NNA were not very powerful classification algorithms, such a performance was extreme high for them. The ROC and PR curves of above three classifiers are illustrated in Figure 3. The AUROC values for three classifiers were 0.9962, 0.9893, and 0.9758, respectively, and AUPR values were 0.9969, 0.9893, and 0.9653, respectively. They were all very high, further suggesting the high performance of three classifiers. The above results implied that the negative samples selected from NS(0.05) were quite different from the positive samples, inducing easy classifications. This fact also proves the effectiveness of the RNSS.
For another pool NS(0.1), negative samples were also randomly selected and comprised the balanced dataset with positive samples. The MCC of the RF classifier on such a dataset was 0.951, which was lower than that yielded by the RF classifier using samples selected from NS(0.05). The SP, ACC, precision, and F1-measure all slightly decreased compared with those of the RF classifier using the samples selected from NS(0.05) (Table 1). The ROC and PR curves as alongside the AUROC and AUPR (Figure 3) of the RF classifier on such a balanced dataset were inferior to those yielded by the RF classifier using samples selected from NS(0.05). A similar phenomenon occurred for the BN and NNA classifiers. It was suggested that the level score (Eq (6)) can really indicate the quality of negative samples, that is, low level scores indicated the high quality of negative samples. It was a proper strategy to select negative samples with low level scores.
In Section 3.1, all classifiers were set up on balanced datasets. To give a further test, some classifiers based on imbalanced datasets were constructed and evaluated. We randomly selected negative samples from each negative sample pool that were twice or thrice as many as positive samples to comprise imbalanced datasets. On each imbalanced dataset, the classifiers with different combinations of drug and disease feature dimensions were set up and evaluated by a ten-fold cross-validation. The best performance for three classification algorithms is listed in Tables 2 and 3.
Classification algorithm | Negative sample pool | Dimension | SN | SP | ACC | Precision | F1-measure | MCC | |
Drug feature | Disease feature | ||||||||
Random forest | NS(0.05) | 1000 | 1000 | 1.000 | 0.989 | 0.993 | 0.979 | 0.989 | 0.984 |
NS(0.1) | 850 | 900 | 0.999 | 0.975 | 0.983 | 0.950 | 0.974 | 0.962 | |
Bayes network | NS(0.05) | 550 | 50 | 1.000 | 0.989 | 0.993 | 0.979 | 0.989 | 0.984 |
NS(0.1) | 550 | 50 | 1.000 | 0.975 | 0.983 | 0.950 | 0.974 | 0.963 | |
Nearest neighbor algorithm | NS(0.05) | 1000 | 50 | 0.978 | 0.985 | 0.983 | 0.971 | 0.974 | 0.962 |
NS(0.1) | 1000 | 50 | 0.953 | 0.973 | 0.967 | 0.946 | 0.950 | 0.925 |
Classification algorithm | Negative sample pool | Dimension | SN | SP | ACC | Precision | F1-measure | MCC | |
Drug feature | Disease feature | ||||||||
Random forest | NS(0.05) | 1000 | 1000 | 0.999 | 0.992 | 0.994 | 0.979 | 0.989 | 0.984 |
NS(0.1) | 850 | 900 | 0.999 | 0.984 | 0.987 | 0.950 | 0.974 | 0.966 | |
Bayes network | NS(0.05) | 550 | 50 | 1.000 | 0.992 | 0.994 | 0.979 | 0.989 | 0.985 |
NS(0.1) | 550 | 50 | 1.000 | 0.984 | 0.987 | 0.950 | 0.974 | 0.966 | |
Nearest neighbor algorithm | NS(0.05) | 1000 | 50 | 0.976 | 0.988 | 0.985 | 0.970 | 0.973 | 0.962 |
NS(0.1) | 1000 | 50 | 0.951 | 0.982 | 0.974 | 0.945 | 0.948 | 0.931 |
Of the imbalanced datasets, which contained negative samples twice as many as positive samples, the RF classifier generated MCC values of 0.984 and 0.962 for two negative sample pools (Table 2). The values for the BN classifier were 0.984 and 0.963 (Table 2). For the NNA classifier, it yielded MCC values of 0.962 and 0.925 (Table 2). It was amazing that these three classifiers on such imbalanced datasets provided even better performance than those on the balanced datasets, as described in Section 3.1. Generally, the performance of the classifiers on imbalanced dataset may decrease, especially on the minor class. However, the overall performance of the above classifiers actually increased. As for their performance on the minor class (positive samples), measured by SN, the decrease range was quite limited. The ROC and PR curves of the above classifiers are illustrated in Figure 4(A), (B). Based on AUROC and AUPR, the performance of the classifiers on imbalanced datasets did not clearly decrease, quite coincident with results assessed by other measurements.
For other imbalanced datasets containing negative samples thrice as many as positive samples, the performance of the three classifiers on two pools is shown in Table 3. It can be observed that such a performance was quite similar to that in Tables 1 and 2. The same conclusions can be obtained from the ROC and PR curves (Figure 4(C), (D)) of these classifiers. These results indicate that the classifiers with an RNSS were not very sensitive to the imbalanced problem. As the negative sample pools constructed by the RNSS included negative samples with high quality, the increment on negative samples did not amplify the difficulties for learning an efficient classifier. Furthermore, the classifiers using negative samples selected from NS(0.05) were superior to those using negative samples selected from NS(0.1), suggesting that the level score was a good indicator to select negative samples with higher quality.
In the RNSS, the distance limitation t was an important parameter, which directly influenced the calculation of the level score (see Eq (6)). It was interesting to investigate the effect of such a parameter. The above classifiers were all based on negative samples selected by the RNSS with t = 2. Here, we investigated the classifiers using negative samples selected by an RNSS with t = 1. In fact, an RNSS with t = 1 was the same as a previous negative sample selection method reported in [59], named finding reliable negative samples (FIRE).
When t was set to 1, numerous unlabeled samples were assigned a level score of 0. Thus, we added the threshold 0 for parameter s (i.e., three negative sample pools were considered), including NS(0), NS(0.05) and NS(0.1). From each pool, we first randomly selected as many negative samples as positive samples to constitute balanced datasets. Three classifiers with different parameter combinations mentioned above were constructed on each balanced dataset and evaluated by a ten-fold cross-validation. The best performance for each classification algorithm was picked up and detailed measurements listed in Section 2.6 are illustrated in Figure 5. For easy comparisons, the performance of classifiers with an RNSS (t = 2) is also listed in this figure. Given the same classification algorithm (RF, BN or NNA), classifiers with an RNSS (t = 2) were generally superior to those with an RNSS (t = 1) in terms of all measurements. The improvement for BN and NNA classifiers was very great, whereas that for the RF classifier was slightly enhanced. As the RF classifiers with an RNSS (t = 1) provided a relatively higher performance than BN and NNA classifiers with an RNSS (t = 1), it was difficult to achieve any improvements.
In addition, the imbalanced datasets were constructed by selecting two and three times as many negative samples as positive samples from each of three pools. Classifiers built on these datasets were also evaluated by a ten-fold cross-validation. The best performance for each classification algorithm is shown in Appendix Figures A1 and A2, from which we can conclude the same result (i.e., the classifiers with an RNSS (t = 2) were obviously better than those with an RNSS (t = 1)).
With the above arguments, despite the balanced or imbalanced datasets, classifiers with an RNSS (t = 2) yielded better performance, indicating that the employment of 2-neighbors of drugs can improve the selection of negative samples. As mentioned above, numerous unlabeled samples were assigned a level score of zero by an RNSS (t = 1). The drug in each of these samples did not have direct neighbors that were related to the disease in the same sample. However, such a drug may have some 2-neighbors that were associated with the disease, which made the level score yielded by an RNSS (t = 2) larger than zero. The employment of 2-neighbors can help us further classify these unlabeled samples, thereby screening out negative samples with a higher quality.
In many studies, random selection of negative samples is a widely used scheme to construct binary classifiers [21,26,27,60,61]. Here, several classifiers were built using such scheme, which were compared with classifiers with an RNSS to elaborate the superiority of the RNSS. To give a full comparison, we constructed three datasets containing negative samples one, two, and three times as many as positive samples. Three classification algorithms (RF, BN and NNA) were adopted to build the classifiers. Feature dimensions of drug and disease were the same as classifiers with an RNSS (t = 2, s = 0.05), as listed in Tables 1-3. All classifiers were also assessed by a ten-fold cross-validation. The performance is listed in Table 4.
Classification algorithm | Ratio of positive and negative samples | SN | SP | ACC | Precision | F1-measure | MCC | AUROC | AUPR |
Random forest | 1:1 | 0.739 | 0.782 | 0.761 | 0.773 | 0.756 | 0.522 | 0.818 | 0.803 |
1:2 | 0.570 | 0.884 | 0.779 | 0.711 | 0.632 | 0.483 | 0.818 | 0.690 | |
1:3 | 0.435 | 0.925 | 0.803 | 0.659 | 0.524 | 0.420 | 0.807 | 0.591 | |
Bayes network | 1:1 | 0.667 | 0.837 | 0.752 | 0.804 | 0.729 | 0.512 | 0.751 | 0.703 |
1:2 | 0.673 | 0.836 | 0.782 | 0.672 | 0.672 | 0.509 | 0.751 | 0.558 | |
1:3 | 0.675 | 0.826 | 0.788 | 0.564 | 0.614 | 0.473 | 0.747 | 0.460 | |
Nearest neighbor algorithm | 1:1 | 0.695 | 0.685 | 0.690 | 0.688 | 0.691 | 0.380 | 0.671 | 0.616 |
1:2 | 0.571 | 0.780 | 0.711 | 0.565 | 0.568 | 0.351 | 0.660 | 0.451 | |
1:3 | 0.497 | 0.826 | 0.744 | 0.487 | 0.492 | 0.321 | 0.638 | 0.346 |
For balanced dataset (i.e., the ratio of positive and negative samples was 1:1), the MCC values of RF, BN, and NNA classifiers were only 0.522, 0.512, and 0.380, respectively. Compared with the MCC values of RF, BN, and NNA classifiers with an RNSS (t = 2, s = 0.05), which were 0.979, 0.979, and 0.951 (Table 1), respectively, such a performance was much lower. The same results can be obtained in terms of other measurements. Thus, the classifiers with an RNSS were much stronger than those using randomly selected negative samples. For the imbalanced datasets (the ratio of positive and negative samples = 1:2 or 1:3), we can also conclude that classifiers using randomly selected negative samples was much inferior to those with an RNSS (see Tables 2-4). Above results indicated that the RNSS was effective to help us select negative samples with high quality, thereby improving the classifiers.
Besides random selection of negative samples, some studies adopted another scheme to select negative samples [62,63]. For unlabeled samples, K-means were adopted to cluster them and negative samples were equally and randomly selected from each cluster. As stated in [64], the clustering effect was best when the unlabeled samples were clustered into 23 clusters. We also adopted such a setting (i.e., unlabeled pairs of drugs and diseases were clustered into 23 clusters). Unlabeled pairs selected from each cluster were combined to constitute the negative sample set, whose size was the same as the positive sample set. Three classification algorithms (RF, BN, and NNA) were adopted to build the classifiers. Feature dimensions of drug and disease were the same as classifiers with an RNSS (t = 2, s = 0.05), as listed in Tables 1. The evaluation results yielded by the ten-fold cross-validation are listed in Table 5. Compared with evaluation results of classifiers with an RNSS (Table 1), these classifiers were very poor. These results further confirmed the utility of the RNSS.
Classification algorithm | Ratio of positive and negative samples | SN | SP | ACC | Precision | F1-measure | MCC | AUROC | AUPR |
Random forest | 1:1 | 0.053 | 0.867 | 0.460 | 0.271 | 0.088 | -0.142 | 0.600 | 0.510 |
Bayes network | 1:1 | 0.662 | 0.809 | 0.736 | 0.776 | 0.714 | 0.477 | 0.734 | 0.684 |
Nearest neighbor algorithm | 1:1 | 0.732 | 0.757 | 0.745 | 0.751 | 0.741 | 0.490 | 0.722 | 0.669 |
To date, several DDA prediction methods have been proposed. Here, some of them were selected to compare with our method with the RNSS. Their performance is listed in Table 6. For easy comparisons, the performance of our method (RF classifier with s = 0.05 and t = 2) is also provided in this table. It can be observed that our method provided best performance for all eight measurements. This result indicated the superiority of our model and the RNSS.
Model | SN | SP | ACC | Precision | F1-measure | MCC | AUROC | AUPR |
Our model | 1.000 | 0.979 | 0.989 | 0.979 | 0.989 | 0.979 | 0.996 | 0.997 |
RepCOOL [24] | 0.930 | - | - | 0.530 | 0.670 | - | 0.670 | - |
RLFDDA [25] | 0.897 | - | 0.901 | 0.904 | 0.900 | - | 0.964 | - |
Li et al.' method [26] | 0.862 | 0.868 | 0.865 | 0.867 | - | 0.730 | 0.936 | 0.935 |
MGP-DDA [23] | 0.842 | - | 0.867 | 0.886 | 0.863 | - | 0.930 | 0.944 |
Yang and Chen's method [21] | 0.872 | 0.843 | 0.858 | 0.847 | 0.860 | 0.716 | 0.928 | 0.919 |
$: Measurements for all methods except our method were directly picked up from their corresponding literature; -: This measurement was not reported. |
The selection of negative samples is a challenging problem in association prediction. Some schemes have been designed in recent years [59,65,66]. The proposed scheme, RNSS, is more similar to the methods in [59,65]. Thus, this section focused on the similarities and differences to the method in [66]. This method was called self-paced negative sampling strategy (SNSS).
SNSS employs a hardness function to indicate the likehood of one unlabeled sample to be an actual negative sample, similar to the level score in our scheme. This function relies on a multilayer perceptron (MLP) classifier, which is very different from the drug network in our scheme. It is defined as the differences of the probability score yielded by an MLP and the ground-truth label. It is hard to say which scoring system is better, as there does not exist any generally accepted dataset to validate whether the score is correct. On the other hand, the selection strategies of the two methods were quite different. In SNSS, it divides unlabeled samples into some categories according to the results of hardness function and selects negative samples from each category with different proportions. The selection scope was all unlabeled samples. Such a selection can fully train the classifier and increase the robustness. In our scheme, we select negative samples from a pool consisting of unlabeled samples with low level scores (i.e., the selection scope was a part of unlabeled samples). Such a selection can improve the performance of the classifier. The different intentions induce different selection strategies. These methods provide alternative ways to select negative samples when building the binary association prediction methods. It is also interesting to fuse the merits of these two methods for designing a new negative sample selection method.
In this study, we proposed a novel scheme to select high-quality negative drug-disease samples. To elaborate its utility, several classifiers were constructed with negative samples selected by such a scheme. The evaluation results suggested that these classifiers had an extremely strong ability to identify DDAs. Additionally, these classifiers were much better than those using some traditional and previous schemes, confirming the positive effects of the proposed scheme in selecting high-quality negative samples. It is hopeful that such a scheme can be applied to other related problems for building classifiers with a high performance.
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
The authors declare there is no conflict of interest.
[1] | Abdulai RT, Owusu-Ansah A (2011) House price determinants in Liverpool, United Kingdom. Current Politics and Economics of Europe, Nova Science Publishers, Inc., 22: 1–26. |
[2] | Afşar A, Yilmazel Ö, Yilmazel S (2017) Konut fiyatlarini etkileyen faktörlerin hedonik model ile belirlenmesi: Eskişehir Örneği. Selçuk Ün Sos Bil Ens Der, 195–205. |
[3] |
Anaman KA, Osei-Amponsah C (2007) Analysis of the causality links between the growth of the construction industry and the growth of the macro-economy in Ghana. Constr Manage Econ 25: 951–961. doi: 10.1080/01446190701411208
![]() |
[4] | Apergis N (2020) Natural disasters and housing prices: fresh evidence from a global country sample. Int Real Estate Rev 23: 815–836. |
[5] | Bahmani-Oskooee M, Ghodsi SH. (2018) Asymmetric causality between the US housing market and its stock market: Evidence from state level data. J Econ Asymmetr 18: e00095. |
[6] | Bahmani-Oskooee M, Ghods SH (2019) On the Link between Value of the Dollar and Housing Production in the U.S.: Evidence from State Level Data. Int Real Estate Rev 22: 231–274. |
[7] |
Beaver WH (1968) The information content of annual earnings announcements. J Account Res 6: 67–92. doi: 10.2307/2490070
![]() |
[8] |
Brau J, Holmes A (2006) Why do REITs repurchase stock? Extricating the effect of managerial signaling in open market share repurchase announcements. J Real Estate Res 28: 1–24. doi: 10.1080/10835547.2006.12091169
![]() |
[9] |
Brown SJ, Warner JB (1980) Measuring security price performance. J Financ Econ 8: 205–258. doi: 10.1016/0304-405X(80)90002-1
![]() |
[10] |
Brown SJ, Warner JB (1985) Using daily stock returns: The case of event studies. J Financ Econ 14: 3–31. doi: 10.1016/0304-405X(85)90042-X
![]() |
[11] |
Chen H, Michaux M, Roussanov N (2020) Houses as ATMs: mortgage refinancing and macroeconomic uncertainty. J Financ 75: 323–375. doi: 10.1111/jofi.12842
![]() |
[12] |
Chen MH, Jang SS, Kim WG (2007) The impact of the SARS outbreak on Taiwanese hotel stock performance: an event-study approach. Int J Hosp Manag 26: 200–212. doi: 10.1016/j.ijhm.2005.11.004
![]() |
[13] |
Cloyne J, Huber K, Ilzetzki E, et al. (2019) The effect of house prices on household borrowing: a new approach. Am Econ Rev 109: 2104–2136. doi: 10.1257/aer.20180086
![]() |
[14] | Contat JC, Turnbull GK, Waller BD (2019) Politics and Prices: Presidential Elections and the Housing Market. |
[15] |
Dyckman T, Philbrick D, Stephan J (1984) A comparison of event study methodologies using daily stock returns: A simulation approach. J Account Res 22: 1–30. doi: 10.2307/2490855
![]() |
[16] | Erol I, Umut U (2015) Role of construction sector in economic growth: new evidence in Turkey. MPRA Paper, 1–31. |
[17] |
Fama E (1991) Efficient capital markets: Ⅱ. J Financ 46: 1575–1617. doi: 10.1111/j.1540-6261.1991.tb04636.x
![]() |
[18] |
Rasmus F, Michael MH (2003) Is sterilised foreign exchange intervention effective after all? An event study approach. Econ J 113: 390–411. doi: 10.1111/1468-0297.00122
![]() |
[19] |
Fehrle D (2019) Housing and the business cycle revisited. J Econ Dyn Control 99: 103–115. doi: 10.1016/j.jedc.2018.12.004
![]() |
[20] | Fuerst F (2005) Exogenous shocks and real estate rental markets: An event study of the 9/11 attacks and their impact on the New York office market. Available from: https://ssrn.com/abstract=800006 or http://dx.doi.org/10.2139/ssrn.800006. |
[21] |
Giang DT, Pheng LS (2011) Role of construction in economic development: Review of key concepts in the past 40 years. Habitat Int 35: 118–125. doi: 10.1016/j.habitatint.2010.06.003
![]() |
[22] |
Glewwe P (1991) Investigating the determinants of household welfare in Côte d'Ivoire. J Dev Econ 35: 307–337. doi: 10.1016/0304-3878(91)90053-X
![]() |
[23] | Greene WH (1993) Econometric analysis. 2nd Edition, New Jersey: Prentice Hall. |
[24] | Greene WH (2000). Econometric analysis. Upper Saddle River, New Jersey: Prentice-Hall. |
[25] | Greene WH (2001) Fixed and Random Effect in Nonlinear Models. Available from: http://www.stren.nyu.edu/eco/wkpapers/workingpapers01/EC-01-01.pdf. |
[26] | Gül ZB (2017) Construction industry in Turkey: an input-output analysis using the world input-output database (wiod) for the 2002–2011 periods. Atatürk Üniversitesi İktisadi ve İdari Bilimler Dergisi 31. |
[27] | Gupta A, Mittal V, Peeters J, et al. (2021) Flattening the curve: pandemic-induced revaluation of urban real estate. NBER Working Papers. |
[28] |
He P, Sun Y, Zhang Y, et al. (2020) COVID-19's impact on stock prices across different sectors—An event study based on the Chinese stock market. Emerg Mark Financ Trade 56: 2198–2212. doi: 10.1080/1540496X.2020.1785865
![]() |
[29] |
Hoesli M, Milcheva S, Moss A (2020) Is Financial Regulation Good or Bad for Real Estate Companies? —An Event Study. J Real Estate Finan Econ 61: 369–407. doi: 10.1007/s11146-017-9634-z
![]() |
[30] | Hu MR, Lee AD, Zou D (2021) COVID-19 and Housing Prices: Australian Evidence with Daily Hedonic Returns. Available from: https://ssrn.com/abstract=3768953 or http://dx.doi.org/10.2139/ssrn.3768953. |
[31] |
Ilhan B, Yaman H (2011) A comparative input-output analysis of the construction sector in Turkey and EU countries. Eng Constr Archit Manag 18: 248–265. doi: 10.1108/09699981111126160
![]() |
[32] |
Jung H, Lee J (2017) The effects of macroprudential policies on house prices: Evidence from an event study using Korean real transaction data. J Financ Stab 31: 167–185. doi: 10.1016/j.jfs.2017.07.001
![]() |
[33] | Warner JB, Kothari SP (2006) Econometrics of event studies. Handb Empir Corp Financ 1: 3–36. |
[34] |
Kofoworola OF, Gheewala S (2008) An input–output analysis of Thailand's construction sector. Constr Manag Econ 26: 1227–1240. doi: 10.1080/01446190802425560
![]() |
[35] |
Lean CS (2001) Empirical tests to discern linkages between construction and other economic sectors in Singapore. Constr Manag Econ 19: 355–363. doi: 10.1080/01446190010022686
![]() |
[36] | Lewis TM (2009) Quantifying the GDP–construction relationship. Econ Mod Built Environ, 34–59. |
[37] |
Liargovas P, Repousis S (2011) The impact of mergers and acquisitions on the performance of the Greek banking sector: An event study approach. Int J Econ Financ 3: 89–100. doi: 10.5539/ijef.v3n2p89
![]() |
[38] |
Ling DC, Wang C, Zhou T (2020) A first look at the impact of COVID-19 on commercial real estate prices: Asset-level evidence. Rev Asset Pricing Stud 10: 669–704. doi: 10.1093/rapstu/raaa014
![]() |
[39] |
Liow KH, Song J (2019) Market Integration Among the US and Asian Real Estate Investment Trusts in Crisis Times. Int Real Estate Rev 22: 463–512. doi: 10.53383/100288
![]() |
[40] | Liu S, Su Y (2021) The impact of the Covid-19 pandemic on the demand for density: Evidence from the US housing market. Econ Lett 207: 110010. |
[41] | Lopes J (2009) Investment in construction and economic growth: a long-term perspective. Econ Mod Buill Environ, 94–112. |
[42] | Lopes J (2012) Construction in the economy and its role in socio-economic development: role of construction in economic development. New perspectives on construction in develops countries, 1st Edition, Routledge, 41–71. |
[43] |
Miyajima H, Yafeh Y (2007) Japan's banking crisis: An event-study perspective. J Bank Financ 31: 2866–2885. doi: 10.1016/j.jbankfin.2007.03.006
![]() |
[44] |
Nanda A, Ross SL (2012) The impact of property condition disclosure laws on housing prices: Evidence from an event study using propensity scores. J Real Estate Financ Econ 45: 88–109. doi: 10.1007/s11146-009-9206-y
![]() |
[45] |
Ozkan F, Ozkan O, Gunduz M (2012) Causal relationship between construction investment policy and economic growth in Turkey. Technol Forecast Soc Change 79: 362–370. doi: 10.1016/j.techfore.2011.04.007
![]() |
[46] | Ozturk N, Fitoz E (2009) Türkiye'de Konut Piyasasinin Belirleyicileri: Ampirik Bir Uygulama. Uluslararası Yönetim İktisat ve İşletme Dergisi 5: 21–46. |
[47] | Patell JM (1976) Corporate forecasts of earnings per share and stock price behavior: Empirical test. J Account Res, 246–276. |
[48] |
Redl C (2018) Macroeconomic uncertainty in south africa. S Afr J Econ 86: 361–380. doi: 10.1111/saje.12198
![]() |
[49] | Rosenthal SS, Strange WC, Urrego JA, (2021) JUE insight: Are city centers losing their appeal? Commercial real estate, urban spatial structure, and COVID-19. J Urban Econ, 103381. |
[50] |
Ruddock L, Lopes J (2006) The construction sector and economic development: the "Bon curve". Constr Manag Econ 24: 717–723. doi: 10.1080/01446190500435218
![]() |
[51] |
Strobel J (2015) On the different approaches of measuring uncertainty shocks. Econ Lett 134: 69–72. doi: 10.1016/j.econlet.2015.06.012
![]() |
[52] |
Tse RYC, Ganesan S (1997) Causal relationship between construction flows and GDP: evidence from Hong Kong. Constr Manag Econ 15: 371–376. doi: 10.1080/014461997372926
![]() |
[53] | Wibowo A (2009) The contribution of the construction industry to the economy of Indonesia: A systemic approach. Available from: http://eprints.undip.ac.id/387/1/Agung_Wibowo.pdf. |
[54] |
Wigren R, Wilhelmsson M (2007) Construction investments and economic growth in Western Europe. J Policy Model 29: 439–451. doi: 10.1016/j.jpolmod.2006.10.001
![]() |
[55] |
Wilhelmsson M, Wigren R (2011) The robustness of the causal and economic relationship between construction flows and economic growth: evidence from Western Europe. Appl Econ 43: 891–900. doi: 10.1080/00036840802600020
![]() |
[56] |
Wu X, Zhang Z (2005) Input–output analysis of the Chinese construction sector. Constr Manag Econ 23: 905–912. doi: 10.1080/01446190500183974
![]() |
[57] | Yamak N, Koçak S, Samut S (2018) Türkiye' de inşaat sektörünün kısa ve uzun dönem dinamikleri. Muğla Sıtkı Koçman Üniversitesi İktisadi ve İdari Bilimler Fakültesi Ekonomi ve Yönetim Araştırmaları Dergisi 7: 96–113. |
[58] | Yang Z (2001) An application of the hedonic price model with uncertain attribute-The case of the People's Republic of China. Prop manag 19: 50–63. |
[59] | Yayar R, Karaca SS (2014) Konut Fiyatlarına Etki Eden Faktörlerin Hedonik Modelle Belirlenmesi: TR83 Bölgesi Örneği. Ege Acad Rev 14: 509–514. |
![]() |
![]() |
1. | Xiandong Lin, QingLan Ma, Lei Chen, Wei Guo, Zhiyi Huang, Tao Huang, Yu-Dong Cai, Identifying genes associated with resistance to KRAS G12C inhibitors via machine learning methods, 2023, 1867, 03044165, 130484, 10.1016/j.bbagen.2023.130484 | |
2. | Bo Zhou, Bing Ran, Lei Chen, A GraphSAGE-based model with fingerprints only to predict drug-drug interactions, 2024, 21, 1551-0018, 2922, 10.3934/mbe.2024130 | |
3. | QingLan Ma, Lei Chen, KaiYan Feng, Wei Guo, Tao Huang, Yu-Dong Cai, Exploring Prognostic Gene Factors in Breast Cancer via Machine Learning, 2024, 0006-2928, 10.1007/s10528-024-10712-w | |
4. | Yong Yang, Yuhang Zhang, Jingxin Ren, Kaiyan Feng, Zhandong Li, Tao Huang, Yudong Cai, Identification of Colon Immune Cell Marker Genes Using Machine Learning Methods, 2023, 13, 2075-1729, 1876, 10.3390/life13091876 | |
5. | Lei Chen, Ruyun Qu, Xintong Liu, Improved multi-label classifiers for predicting protein subcellular localization, 2023, 21, 1551-0018, 214, 10.3934/mbe.2024010 | |
6. | Qinglan Ma, Yulong Shen, Wei Guo, Kaiyan Feng, Tao Huang, Yudong Cai, Machine Learning Reveals Impacts of Smoking on Gene Profiles of Different Cell Types in Lung, 2024, 14, 2075-1729, 502, 10.3390/life14040502 | |
7. | Lei Chen, Chenyu Zhang, Jing Xu, PredictEFC: a fast and efficient multi-label classifier for predicting enzyme family classes, 2024, 25, 1471-2105, 10.1186/s12859-024-05665-1 | |
8. | Wenwei Wu, Tao He, Xiaying Hao, Kaiwei Xu, Ji Zeng, Jiahui Gu, Lei Chen, Zhengmao Li, Machine learning based method for analyzing vibration and noise in large cruise ships, 2024, 19, 1932-6203, e0307835, 10.1371/journal.pone.0307835 | |
9. | JingXin Ren, Qian Gao, XianChao Zhou, Lei Chen, Wei Guo, KaiYan Feng, Tao Huang, Yu-Dong Cai, Identification of key gene expression associated with quality of life after recovery from COVID-19, 2024, 62, 0140-0118, 1031, 10.1007/s11517-023-02988-8 | |
10. | Jing Xin Ren, Lei Chen, Wei Guo, Kai Yan Feng, Yu-Dong Cai, Tao Huang, Patterns of Gene Expression Profiles Associated with Colorectal Cancer in Colorectal Mucosa by Using Machine Learning Methods, 2024, 27, 13862073, 2921, 10.2174/0113862073266300231026103844 | |
11. | Jiyu Zhang, Tao Huang, Qiao Sun, Jian Zhang, Identifying Pathological Myopia Associated Genes with A Random Walk-Based Method in Protein-Protein Interaction Network, 2024, 19, 15748936, 375, 10.2174/0115748936268218231114070754 | |
12. | Jingxin Ren, XianChao Zhou, Ke Huang, Lei Chen, Wei Guo, KaiYan Feng, Tao Huang, Yu-Dong Cai, Identification of key genes associated with persistent immune changes and secondary immune activation responses induced by influenza vaccination after COVID-19 recovery by machine learning methods, 2024, 169, 00104825, 107883, 10.1016/j.compbiomed.2023.107883 | |
13. | Lei Chen, Xiaoyu Zhao, PCDA-HNMP: Predicting circRNA-disease association using heterogeneous network and meta-path, 2023, 20, 1551-0018, 20553, 10.3934/mbe.2023909 | |
14. | Lei Chen, Linyang Li, Prediction of Drug Pathway-based Disease Classes using Multiple Properties of Drugs, 2024, 19, 15748936, 859, 10.2174/0115748936284973240105115444 | |
15. | Yaowen Gu, Si Zheng, Bowen Zhang, Hongyu Kang, Rui Jiang, Jiao Li, Deep multiple instance learning on heterogeneous graph for drug–disease association prediction, 2025, 184, 00104825, 109403, 10.1016/j.compbiomed.2024.109403 | |
16. | Lei Chen, Shiyi Zhang, Bo Zhou, Herb-disease association prediction model based on network consistency projection, 2025, 15, 2045-2322, 10.1038/s41598-025-87521-7 | |
17. | QingLan Ma, Jingxin Ren, Lei Chen, Wei Guo, KaiYan Feng, Tao Huang, Yu-Dong Cai, Identifying Key Clinical Indicators Associated with the Risk of Death in Hospitalized COVID-19 Patients, 2025, 20, 15748936, 359, 10.2174/0115748936306893240720192301 |
Classification algorithm | Negative sample pool | Dimension | SN | SP | ACC | Precision | F1-measure | MCC | |
Drug feature | Disease feature | ||||||||
Random forest | NS(0.05) | 1000 | 1000 | 1.000 | 0.979 | 0.989 | 0.979 | 0.989 | 0.979 |
NS(0.1) | 850 | 900 | 1.000 | 0.953 | 0.975 | 0.950 | 0.974 | 0.951 | |
Bayes network | NS(0.05) | 550 | 50 | 1.000 | 0.979 | 0.989 | 0.979 | 0.989 | 0.979 |
NS(0.1) | 550 | 50 | 1.000 | 0.952 | 0.975 | 0.950 | 0.974 | 0.950 | |
Nearest neighbor algorithm | NS(0.05) | 1000 | 50 | 0.978 | 0.974 | 0.976 | 0.974 | 0.976 | 0.951 |
NS(0.1) | 1000 | 50 | 0.956 | 0.954 | 0.955 | 0.954 | 0.955 | 0.910 |
Classification algorithm | Negative sample pool | Dimension | SN | SP | ACC | Precision | F1-measure | MCC | |
Drug feature | Disease feature | ||||||||
Random forest | NS(0.05) | 1000 | 1000 | 1.000 | 0.989 | 0.993 | 0.979 | 0.989 | 0.984 |
NS(0.1) | 850 | 900 | 0.999 | 0.975 | 0.983 | 0.950 | 0.974 | 0.962 | |
Bayes network | NS(0.05) | 550 | 50 | 1.000 | 0.989 | 0.993 | 0.979 | 0.989 | 0.984 |
NS(0.1) | 550 | 50 | 1.000 | 0.975 | 0.983 | 0.950 | 0.974 | 0.963 | |
Nearest neighbor algorithm | NS(0.05) | 1000 | 50 | 0.978 | 0.985 | 0.983 | 0.971 | 0.974 | 0.962 |
NS(0.1) | 1000 | 50 | 0.953 | 0.973 | 0.967 | 0.946 | 0.950 | 0.925 |
Classification algorithm | Negative sample pool | Dimension | SN | SP | ACC | Precision | F1-measure | MCC | |
Drug feature | Disease feature | ||||||||
Random forest | NS(0.05) | 1000 | 1000 | 0.999 | 0.992 | 0.994 | 0.979 | 0.989 | 0.984 |
NS(0.1) | 850 | 900 | 0.999 | 0.984 | 0.987 | 0.950 | 0.974 | 0.966 | |
Bayes network | NS(0.05) | 550 | 50 | 1.000 | 0.992 | 0.994 | 0.979 | 0.989 | 0.985 |
NS(0.1) | 550 | 50 | 1.000 | 0.984 | 0.987 | 0.950 | 0.974 | 0.966 | |
Nearest neighbor algorithm | NS(0.05) | 1000 | 50 | 0.976 | 0.988 | 0.985 | 0.970 | 0.973 | 0.962 |
NS(0.1) | 1000 | 50 | 0.951 | 0.982 | 0.974 | 0.945 | 0.948 | 0.931 |
Classification algorithm | Ratio of positive and negative samples | SN | SP | ACC | Precision | F1-measure | MCC | AUROC | AUPR |
Random forest | 1:1 | 0.739 | 0.782 | 0.761 | 0.773 | 0.756 | 0.522 | 0.818 | 0.803 |
1:2 | 0.570 | 0.884 | 0.779 | 0.711 | 0.632 | 0.483 | 0.818 | 0.690 | |
1:3 | 0.435 | 0.925 | 0.803 | 0.659 | 0.524 | 0.420 | 0.807 | 0.591 | |
Bayes network | 1:1 | 0.667 | 0.837 | 0.752 | 0.804 | 0.729 | 0.512 | 0.751 | 0.703 |
1:2 | 0.673 | 0.836 | 0.782 | 0.672 | 0.672 | 0.509 | 0.751 | 0.558 | |
1:3 | 0.675 | 0.826 | 0.788 | 0.564 | 0.614 | 0.473 | 0.747 | 0.460 | |
Nearest neighbor algorithm | 1:1 | 0.695 | 0.685 | 0.690 | 0.688 | 0.691 | 0.380 | 0.671 | 0.616 |
1:2 | 0.571 | 0.780 | 0.711 | 0.565 | 0.568 | 0.351 | 0.660 | 0.451 | |
1:3 | 0.497 | 0.826 | 0.744 | 0.487 | 0.492 | 0.321 | 0.638 | 0.346 |
Classification algorithm | Ratio of positive and negative samples | SN | SP | ACC | Precision | F1-measure | MCC | AUROC | AUPR |
Random forest | 1:1 | 0.053 | 0.867 | 0.460 | 0.271 | 0.088 | -0.142 | 0.600 | 0.510 |
Bayes network | 1:1 | 0.662 | 0.809 | 0.736 | 0.776 | 0.714 | 0.477 | 0.734 | 0.684 |
Nearest neighbor algorithm | 1:1 | 0.732 | 0.757 | 0.745 | 0.751 | 0.741 | 0.490 | 0.722 | 0.669 |
Model | SN | SP | ACC | Precision | F1-measure | MCC | AUROC | AUPR |
Our model | 1.000 | 0.979 | 0.989 | 0.979 | 0.989 | 0.979 | 0.996 | 0.997 |
RepCOOL [24] | 0.930 | - | - | 0.530 | 0.670 | - | 0.670 | - |
RLFDDA [25] | 0.897 | - | 0.901 | 0.904 | 0.900 | - | 0.964 | - |
Li et al.' method [26] | 0.862 | 0.868 | 0.865 | 0.867 | - | 0.730 | 0.936 | 0.935 |
MGP-DDA [23] | 0.842 | - | 0.867 | 0.886 | 0.863 | - | 0.930 | 0.944 |
Yang and Chen's method [21] | 0.872 | 0.843 | 0.858 | 0.847 | 0.860 | 0.716 | 0.928 | 0.919 |
$: Measurements for all methods except our method were directly picked up from their corresponding literature; -: This measurement was not reported. |
Classification algorithm | Negative sample pool | Dimension | SN | SP | ACC | Precision | F1-measure | MCC | |
Drug feature | Disease feature | ||||||||
Random forest | NS(0.05) | 1000 | 1000 | 1.000 | 0.979 | 0.989 | 0.979 | 0.989 | 0.979 |
NS(0.1) | 850 | 900 | 1.000 | 0.953 | 0.975 | 0.950 | 0.974 | 0.951 | |
Bayes network | NS(0.05) | 550 | 50 | 1.000 | 0.979 | 0.989 | 0.979 | 0.989 | 0.979 |
NS(0.1) | 550 | 50 | 1.000 | 0.952 | 0.975 | 0.950 | 0.974 | 0.950 | |
Nearest neighbor algorithm | NS(0.05) | 1000 | 50 | 0.978 | 0.974 | 0.976 | 0.974 | 0.976 | 0.951 |
NS(0.1) | 1000 | 50 | 0.956 | 0.954 | 0.955 | 0.954 | 0.955 | 0.910 |
Classification algorithm | Negative sample pool | Dimension | SN | SP | ACC | Precision | F1-measure | MCC | |
Drug feature | Disease feature | ||||||||
Random forest | NS(0.05) | 1000 | 1000 | 1.000 | 0.989 | 0.993 | 0.979 | 0.989 | 0.984 |
NS(0.1) | 850 | 900 | 0.999 | 0.975 | 0.983 | 0.950 | 0.974 | 0.962 | |
Bayes network | NS(0.05) | 550 | 50 | 1.000 | 0.989 | 0.993 | 0.979 | 0.989 | 0.984 |
NS(0.1) | 550 | 50 | 1.000 | 0.975 | 0.983 | 0.950 | 0.974 | 0.963 | |
Nearest neighbor algorithm | NS(0.05) | 1000 | 50 | 0.978 | 0.985 | 0.983 | 0.971 | 0.974 | 0.962 |
NS(0.1) | 1000 | 50 | 0.953 | 0.973 | 0.967 | 0.946 | 0.950 | 0.925 |
Classification algorithm | Negative sample pool | Dimension | SN | SP | ACC | Precision | F1-measure | MCC | |
Drug feature | Disease feature | ||||||||
Random forest | NS(0.05) | 1000 | 1000 | 0.999 | 0.992 | 0.994 | 0.979 | 0.989 | 0.984 |
NS(0.1) | 850 | 900 | 0.999 | 0.984 | 0.987 | 0.950 | 0.974 | 0.966 | |
Bayes network | NS(0.05) | 550 | 50 | 1.000 | 0.992 | 0.994 | 0.979 | 0.989 | 0.985 |
NS(0.1) | 550 | 50 | 1.000 | 0.984 | 0.987 | 0.950 | 0.974 | 0.966 | |
Nearest neighbor algorithm | NS(0.05) | 1000 | 50 | 0.976 | 0.988 | 0.985 | 0.970 | 0.973 | 0.962 |
NS(0.1) | 1000 | 50 | 0.951 | 0.982 | 0.974 | 0.945 | 0.948 | 0.931 |
Classification algorithm | Ratio of positive and negative samples | SN | SP | ACC | Precision | F1-measure | MCC | AUROC | AUPR |
Random forest | 1:1 | 0.739 | 0.782 | 0.761 | 0.773 | 0.756 | 0.522 | 0.818 | 0.803 |
1:2 | 0.570 | 0.884 | 0.779 | 0.711 | 0.632 | 0.483 | 0.818 | 0.690 | |
1:3 | 0.435 | 0.925 | 0.803 | 0.659 | 0.524 | 0.420 | 0.807 | 0.591 | |
Bayes network | 1:1 | 0.667 | 0.837 | 0.752 | 0.804 | 0.729 | 0.512 | 0.751 | 0.703 |
1:2 | 0.673 | 0.836 | 0.782 | 0.672 | 0.672 | 0.509 | 0.751 | 0.558 | |
1:3 | 0.675 | 0.826 | 0.788 | 0.564 | 0.614 | 0.473 | 0.747 | 0.460 | |
Nearest neighbor algorithm | 1:1 | 0.695 | 0.685 | 0.690 | 0.688 | 0.691 | 0.380 | 0.671 | 0.616 |
1:2 | 0.571 | 0.780 | 0.711 | 0.565 | 0.568 | 0.351 | 0.660 | 0.451 | |
1:3 | 0.497 | 0.826 | 0.744 | 0.487 | 0.492 | 0.321 | 0.638 | 0.346 |
Classification algorithm | Ratio of positive and negative samples | SN | SP | ACC | Precision | F1-measure | MCC | AUROC | AUPR |
Random forest | 1:1 | 0.053 | 0.867 | 0.460 | 0.271 | 0.088 | -0.142 | 0.600 | 0.510 |
Bayes network | 1:1 | 0.662 | 0.809 | 0.736 | 0.776 | 0.714 | 0.477 | 0.734 | 0.684 |
Nearest neighbor algorithm | 1:1 | 0.732 | 0.757 | 0.745 | 0.751 | 0.741 | 0.490 | 0.722 | 0.669 |
Model | SN | SP | ACC | Precision | F1-measure | MCC | AUROC | AUPR |
Our model | 1.000 | 0.979 | 0.989 | 0.979 | 0.989 | 0.979 | 0.996 | 0.997 |
RepCOOL [24] | 0.930 | - | - | 0.530 | 0.670 | - | 0.670 | - |
RLFDDA [25] | 0.897 | - | 0.901 | 0.904 | 0.900 | - | 0.964 | - |
Li et al.' method [26] | 0.862 | 0.868 | 0.865 | 0.867 | - | 0.730 | 0.936 | 0.935 |
MGP-DDA [23] | 0.842 | - | 0.867 | 0.886 | 0.863 | - | 0.930 | 0.944 |
Yang and Chen's method [21] | 0.872 | 0.843 | 0.858 | 0.847 | 0.860 | 0.716 | 0.928 | 0.919 |
$: Measurements for all methods except our method were directly picked up from their corresponding literature; -: This measurement was not reported. |