1.
Introduction
With the rapid development of DNA microarray technology, it is possible to monitor gene activity from multiple aspects through the gene expression data. As gene expression reflects human health, it is potentially helpful for disease identification, prevention and treatment. However, it remains a challenging task to find valuable information from gene expression data. One of the main reasons is that gene expression data consisting of thousands of dimensions, whereas only a small part is instructive [1]. Actually, feature selection and feature dimensionality reduction are two kinds of representative methods to select instructive features from gene expression data. For the former, a subset is selected from gene expression data, where Filter [2,3,4,5,6,7,8,9], Wrapper [10,11,12,13,14,15] and Embedded [16] are representative methods. Correspondingly, feature dimensionality reduction methods focus on mapping the features from high dimensional spaces to low-dimensional spaces, where feature values generally change during the mapping process. Representative feature dimensionality reduction methods are principal component analysis (PCA) [17,18], multiple dimensional scaling (MDS), locally linear embedding (LLE) [19,20], and so on.
Recently, deep learning methods have been widely used in many fields. For example, ResNet [21] and SENet [22] have achieved excellent performance in the field of image classification. However, there is still relatively little research in the field of selecting of instructive features from gene expression data. In this paper, we propose a deep learning model, namely GSEnet, to extract useful features from gene expression data. GSEnet is a hybrid of ResNet [21] and SENet [22]. Nine classifiers have been applied on the features extracted by the proposed model to evaluate its effectiveness.
The remainder of this paper is organized as follows. In Section 2, we briefly introduce the dataset used to evaluate the proposed model. In Section 3, we give details of the proposed model. In Section 4, we perform experimental evaluations of the proposed model. Finally, we discuss and conclude this paper in Sections 5 and 6.
2.
DataSet
In this paper, we take a publicly available real single-cell RNA-seq as the evaluation dataset. The dataset comes from NCBI data repository (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE99095), values of which are collected from bone marrow cells [23]. It consists of 979 samples among which 391 from healthy donors and 588 from patients with bone marrow failure and cytogenetic abnormalities. 17,258 expression genes have been monitored for each sample.
3.
Methods
In this section, we describe details and framework of the proposed model (shown in Table 1 and Figure 1). The framework of the proposed model consists of three modules, namely the pre-conv module, the SE-Resnet module and the SE-conv module, respectively. Details of the modules are shown in the following subsections.
3.1. pre-conv module
The pre-conv module consists of a convolutional layer and a pooling layer, construction and function of which are the same as the first module in Resnet. Specifically, it employs a large convolutional kernel of size 7. Down-sampling has been directly performed at a stride of 2 to effectively reduce the feature dimension, as well as strengthen the local feature correlation. A maximum pooling layer is additionally used to reduce the feature dimension further. This operation allows to effectively extract salient features encoded by the convolutional layer.
3.2. SE-Resnet module
The SE-Resnet module, as shown in Figure 2, consists of a ResNet block [21] and an SE block [22]. The ResNet block is helpful for the model to reuse features extracted by the pre-conv module, while the SE block is used to extract key features. The fusion and reasonable stacking of the two blocks helps the proposed model to extract higher-level semantic information. The output of the SE-Resnet module is defined as
where x is the input feature maps, f indicates the convolution operation, w1, w2, w3 and w5 are the parameters of each of the four convolution layers, δ refers to the ReLU function, and SE(∙) corresponds to the SE block, parameters of which is denoted by w4.
Let, ˆx=f(f(f(x,w1),w2),w3) the output of the SE block is defined as
where
Note that L is the length of each feature map from ˆx.
3.3. SE-conv module
The SE-conv module is similar to the SE-Resnet module except an additional pooling layer is appended at the end of the module as shown in Figure 2. Its function is mainly to increase the feature levels and reduce the feature dimension. Let w6, w7, w8 and w10 be the parameters of each of the four convolution layers and x the input feature maps, the output of the SE-conv module is defined as
where w9 denotes parameters of the SE block and P(∙) is the global average pooling if the SE block is the last one or an average pooling with factor 2 for other SE blocks.
4.
Experiments
4.1. Experimental details
Training details. During the training process, we connect a multilayer perceptron model at the end of the proposed instructive feature extraction model as shown in Figure 1. The number of nodes in the hidden layer is 256 and 64 respectively. Adam optimizer is used. The learning rate is set to 10−6, and the loss function is cross entropy. The dataset is divided into training and validation sets by a 9-to-1 ratio. Early stop is adopted to avoid over fitting.
Test details. Ten-fold cross-validation is used to evaluate performance of the selected classifiers, details of which will be described in the next subsection.
Experimental Environment. The proposed model is implemented in PyCharm on a computer with Inter(R) Core(TM) i7-8700U CPU @ 3.20GHz, NVIDIA GeForce GTX 1050 Ti, and Windows 10 operating system. It costs 12 hours to train the proposed model and the prediction is less than 1 minute.
4.2. Experimental results
To verify effectiveness of features extracted by the proposed model, they are applied into the following classifiers, Support Vector Machine (SVM), Decision Tree (DT), Random forest (RF), Naive Bayes (NB), Logistic Regression (LR), k-Nearest Neighbor (KNN), AdaBoost (ADA), Gradient Boosted Decision Tree (GBDT) and Linear Discriminant Analysis (LDA). Implementations of all these classifiers can be found in the scikit-learn library for python (https://scikit-learn.org/stable/). The evaluation metrics we adopted are true positive rate (TPR), false negative rate (FNR), false positive rate (FPR), ture negative rate (TNR), Precision(PRE), F1-score (F1), and accuracy (ACC), respectively. Specifically, the above mentioned evaluation metrics are defined as TPR=TP/(TP+FN), FNR=FN/(TP+FN), FPR=FP/(FP+TN), TNR=TN/(TN+FP), PRE=TP/(TP+FP), F1=2×P×R/(P+R), and ACC=(TP+TN)/(TP+FP+TN+FN). TP is true positive representing the total number of samples correctly identified as positive, while FP is false positive representing the number of samples incorrectly identified as positive. Similarly, TN is true negative, namely the number of samples correctly identified as negative, whereas FN is a false negative, i.e. the number of samples incorrectly identified as negative. P and R are precision and recall. Experimental results are given in Figure 3 and Table 2. Note that values in Figure 3 are metric means of ten-fold cross-evaluation. Standard deviations are given in Table 2. It is obvious that performances of the classifiers are improved. In particular, KNN, ADA, NB, RF, DT and LDT, which do not perform well on original samples, achieve similar performances as SVM and GDBT. Thus, the proposed model is effective on performance improvement of the classifiers.
5.
Discussion
5.1. Effect of the SE-Resnet modules
In this subsection, we discuss effects of SE-Resnet modules on classifier performance in terms of F1 score and results are given in Table 3. From Table 1, we can see that the number of SE-Resnet modules is ten. We delete existing modules or add new ones to the model with S and A as indicators. S2 and S4 indicate remove 2 and 4 modules from the SE-Resnet module shown in Table 1 with ID = 6, respectively, while S6 further remove 2 modules from the SE-Resnet module with tag ID = 4. Correspondingly, A2 and A4 indicate appending one or two additional SE-Resnet module for the modules given in ID = 4 and ID = 6. It is obvious that the original network setting given in Table 1, namely GSEnet, performs the best. However, effect of structure changes on the performance is little.
5.2. Comparison with feature selection methods
In this subsection, we compare the proposed model with representative feature selection methods, i.e., t-test (T), analysis of variance (Var), lasso feature selection (Lasso) and Logistic Regression feature selection (Log) in terms of F1 and show the result in Table 4. The features with a p-value less than 0.05 are included in the results of t-test, and 8174 feature values remain. For others, we select the top K salient feature values, where K = 256, K = 512, K = 1024, K = 2048 and K = 4096. For the proposed model, we modify the number of output channels of the last convolutional layer. It is obvious that the proposed model introduces the best performance on DT, RF, NB, KNN, ADA and GDBT classifiers.
6.
Conclusions
In this paper, a novel deep learning model, namely GSEnet, is proposed. It combines ResNet and SENet, and is constructed to improve the extraction of instructive features from gene expression data. The proposed model has been evaluated on the GSE99095 dataset with 9 representative classifiers. Experimental results show advantages of the proposed model on performance improvement of different classifiers compared with t-test, analysis of variance, lasso and Logistic Regression feature selection methods, GSEnet introduces the best performance on DT, RF, NB, KNN, ADA and GDBT classifiers.
Acknowledgments
This work was supported by the Natural Science Foundation of Liaoning Province under grant 2021-MS-085. The authors would like to thank the handling editor and anonymous reviewers very much for their constructive suggestions on improving quality of this paper.
Conflict of Interest
No potential conflict of interest was reported by the authors.