Computational methods for recognition of cancer protein markers in saliva

Ying Sun; Wei Du; Lili Yang; Min Dai; Ziying Dou; Yuxiang Wang; Jining Liu; Gang Zheng; Ying Sun; Wei Du; Lili Yang; Min Dai; Ziying Dou; Yuxiang Wang; Jining Liu; Gang Zheng

doi:10.3934/mbe.2020134

Mathematical Biosciences and Engineering

2020, Volume 17, Issue 3: 2453-2469. doi: 10.3934/mbe.2020134

Previous Article Next Article

Research article Special Issues

Computational methods for recognition of cancer protein markers in saliva

1.
Tianjin Key Laboratory of Intelligence Computing and Novel Software Technology, School of Computer Science and Engineering, Tianjin University of Technology, Tianjin 300384, China
2.
Information Technology Research Base of Civil Aviation Administration of China, Civil Aviation University of China, Tianjin 300300, China
3.
Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
4.
Department of Obstetrics, The First Hospital of Jilin University, Changchun 130012, China

Received: 28 February 2018 Accepted: 17 February 2019 Published: 25 February 2020

In recent years, many studies have supported that cancer tissues can make disease-specific changes in some salivary proteins through some mediators in the pathogenesis of systemic diseases. These salivary proteins have the potential to become cancer-specific biomarkers in the early diagnosis stage. How to effectively identify these potential markers is one of the challenging issues. In this paper, we propose novel machine learning methods for recognition cancer biomarkers in saliva by two stages. In the first stage, salivary secreted proteins are recognized which are considered as candidate biomarkers of cancers. We picked up 557 salivary secretory proteins from 20379 human proteins by public databases and published literatures. Then, we present a training set construction strategy to solve the imbalance problem in order to make the classification methods get better accuracy. From all human protein set, the proteins belonging to the same families as salivary secretory proteins are removed. After that, we use SVC-KM method to cluster the remaining proteins, and select negative samples from each cluster in proportion. Next, the features of proteins are calculated by tools. We collect 24 protein properties such as sequence, structure and physicochemical properties, a total of 1087 features. An innovative procedure based on the local samples is proposed for selecting the appropriate features, in order to further improve the performance of SVM classifier. Experimental results show that the average sensitivity, specificity and accuracy of salivary secretory protein recognition using selected 32 features in training set are 97.09%, 98.10%, 97.61%, respectively. The use of these methods can improve the accuracy of recognition by solving the problems of unbalanced sample size and uneven distribution in training set. In the second stage, we apply the best model to dig out the salivary secreted proteins from 58 reported cancer markers, and get a total of 42 proteins which are considered to be used for salivary diagnosis. We analyze the gene expression data of three types of cancer, and predict that 33 genes will appear in saliva after they are translated into proteins. This study provides an important computational tool to help biologists and researchers reduce the number of candidate proteins and the cost of research. So as to further accelerate the discovery of cancer biomarkers in saliva and promote the development of saliva diagnosis.
- salivary secretory protein,
- cancer biomarker,
- feature selection based on local samples,
- SVC-KM,
- computational methods
Citation: Ying Sun, Wei Du, Lili Yang, Min Dai, Ziying Dou, Yuxiang Wang, Jining Liu, Gang Zheng. Computational methods for recognition of cancer protein markers in saliva[J]. Mathematical Biosciences and Engineering, 2020, 17(3): 2453-2469. doi: 10.3934/mbe.2020134

Related Papers:

Abstract

In recent years, many studies have supported that cancer tissues can make disease-specific changes in some salivary proteins through some mediators in the pathogenesis of systemic diseases. These salivary proteins have the potential to become cancer-specific biomarkers in the early diagnosis stage. How to effectively identify these potential markers is one of the challenging issues. In this paper, we propose novel machine learning methods for recognition cancer biomarkers in saliva by two stages. In the first stage, salivary secreted proteins are recognized which are considered as candidate biomarkers of cancers. We picked up 557 salivary secretory proteins from 20379 human proteins by public databases and published literatures. Then, we present a training set construction strategy to solve the imbalance problem in order to make the classification methods get better accuracy. From all human protein set, the proteins belonging to the same families as salivary secretory proteins are removed. After that, we use SVC-KM method to cluster the remaining proteins, and select negative samples from each cluster in proportion. Next, the features of proteins are calculated by tools. We collect 24 protein properties such as sequence, structure and physicochemical properties, a total of 1087 features. An innovative procedure based on the local samples is proposed for selecting the appropriate features, in order to further improve the performance of SVM classifier. Experimental results show that the average sensitivity, specificity and accuracy of salivary secretory protein recognition using selected 32 features in training set are 97.09%, 98.10%, 97.61%, respectively. The use of these methods can improve the accuracy of recognition by solving the problems of unbalanced sample size and uneven distribution in training set. In the second stage, we apply the best model to dig out the salivary secreted proteins from 58 reported cancer markers, and get a total of 42 proteins which are considered to be used for salivary diagnosis. We analyze the gene expression data of three types of cancer, and predict that 33 genes will appear in saliva after they are translated into proteins. This study provides an important computational tool to help biologists and researchers reduce the number of candidate proteins and the cost of research. So as to further accelerate the discovery of cancer biomarkers in saliva and promote the development of saliva diagnosis.

References

[1]	R. Ruddon, Cancer Biology, Oxford University Press, 2007.
[2]	Y. Wang, S. Liang, Y. Tian, J. Zhao, W. Du, Y. Liang, et al., Using machine learning to measure relatedness between genes: a multi-features model, Sci. Rep., 9 (2019), 1-15.
[3]	S. Liang, A. Ma, S. Yang, Y. Wang, Q. Ma, A review of Matched-pairs feature selection methods for gene expression data analysis, Comput. Structur. Biotechnol. J., 16 (2018), 88-97.
[4]	A.W. Partin, J. Yoo, H. B. Carter, J. D. Pearson, D. W. Chan, J. I. Epstein, et al., The use of prostate specific antigen, clinical stage and Gleason score to predict pathological stage in men with localized prostate cancer, J. Urol., 150 (1993), 110-114.
[5]	M. Hollstein, D. Sidransky, B. Vogelstein, C. C. Harris, P53 mutations in human cancers, J. Sci., 253 (1991), 49-53.
[6]	K. E. Stuart, A. J. Anand, R. L. Jenkins, Hepatocellular carcinoma in the United States: prognostic features, treatment outcome, and survival, Cancer Interdiscipl. Int. J. Am. Cancer Soc., 77 (1996), 2217-2222.
[7]	P. Kuusela, C. Haglund, P. J. Roberts, Comparison of a new tumour marker CA 242 with CA 199, CA 50 and carcinoembryonic antigen (CEA) in digestive tract diseases, British J. Cancer, 63 (1991), 636-640.
[8]	J. Schneider, H. G. Velcovsky, H. Morr, N. Katz, K. Neu, E. Eigenbrodt, Comparison of the tumor markers tumor M2-PK, CEA, CYFRA 21-1, NSE and SCC in the diagnosis of lung cancer, Anticancer Res., 20 (2000), 5053-5058.
[9]	L. A. Cole, J. M. Sutton, Selecting an appropriate hCG test for managing gestational trophoblastic disease and cancer, J. Reproduct. Med., 49 (2004), 545-553.
[10]	J. A. Ludwig, J. N. Weinstein, Biomarkers in cancer staging, prognosis and treatment selection, Nat. Rev. Cancer, 5(2005), 845-856.
[11]	G. J. Rustin, M. Marples, A. E. Nelstrop, M. Mahmoudi, T. Meyer, Use of CA-125 to define progression of ovarian cancer in patients with persistently elevated levels, J. Clin. Oncol., 19 (2001), 4054-4057.
[12]	H. Zheng, R. C. Luo, Diagnostic value of combined detection of TPS, CA153 and CEA in breast cancer, J. First Milit. Med. Univers., 25 (2003), 1293.
[13]	H. Q. Zhang, R. B.Wang, H. J. Yan, W. Zhao, K. L. Zhu, S. M. Jiang, et al., Prognostic significance of CYFRA21-1, CEA and hemoglobin in patients with esophageal squamous cancer undergoing concurrent chemoradiotherapy, Asian Pacific J. Cancer Prevent., 13 (2012), 199-203.
[14]	A. Hsu, S. L. Tang, S. Halgamuge, An unsupervised hierarchical dynamic self-organising approach to cancer class discovery and marker gene identification in microarray data, Bioinformatics, 19 (2003), 2131-2140.
[15]	J. J. Liu, G. Cutler, W. Li, Z. Pan, S. Peng, T. Hoey, et al., Multiclass cancer classification and biomarker discovery using GA-based algorithms, Bioinformatics, 21 (2005), 2691-2697.
[16]	B. J. Beattie, P. N. Robinson, Binary state pattern clustering: A digital paradigm for class and biomarker discovery in gene microarray studies of cancer, J. Comput. Biol., 13 (2006), 1114-1130.
[17]	C. Harris, N. Ghaffari, Biomarker discovery across annotated and unannotated microarray datasets using semi-supervised learning, BMC Genomics, 9(2008), S7.
[18]	T. Abeel, T. Helleputte, Y. Van de Peer, P. Dupont, Y. Saeys, Robust biomarker identification for cancer diagnosis with ensemble feature selection methods, Bioinformatics, 26 (2010), 392-398.
[19]	L. Chen, J. Xuan, C. Wang, I. M. Shih, Y. Wang, Z. Zhang, et al., Knowledge-guided multi-scale independent component analysis for biomarker identification, BMC Bioinformatics, 9 (2008), 416.
[20]	J. Cui, Q. Liu, D. Puett, Y. Xu, Computational prediction of human proteins that can be secreted into the bloodstrea, Bioinformatics, 24 (2008), 2370-2375.
[21]	J. Cui, Y. Chen, W. C. Chou, L. Sun, L. Chen, J. Suo, et al., An integrated transcriptomic and computational analysis for biomarker identification in gastric cancer, Nucleic Acids Res., 39 (2011),1197-1207.
[22]	C. S. Hong, J. Cui, Z. Ni, Y. Su, D. Puett, F. Li, et al., A computational method for prediction of excretory proteins and application to identification of gastric cancer markers in urine, PloS One, 6 (2011), e16875.
[23]	J. Wang, Y. Liang, Y. Wang, J. Cui, M. Liu, W. Du, et al., Computational prediction of human salivary proteins from blood circulation and application to diagnostic biomarker identification, PloS One, 8 (2013), e80211.
[24]	Y. Sun, W. Du, C. Zhou, Y. Zhou, Z. Cao, Y. Tian, et al., A Computational Method for Prediction of Saliva-Secretory Proteins and its Application to Identification of Head and Neck Cancer Biomarkers for Salivary Diagnosis, IEEE Transact. Nanobiosci., 14 (2015),167-174.
[25]	A. Ben-Hur, D. Horn, H. T. Siegelmann, V. Vapnik, A support vector method for clustering, Adv. Neural Inform. Process. Syst., 13 (2001), 367-373.
[26]	Y. Chen, Y. Zhang, Y. Yin, G. Gao, S. Li, Y. Jiang, et al., SPD-a web-based secreted protein database, Nucleic Acids Res., 33 (2005), D169-D173.
[27]	J. Sprenger, J. Lynn Fink, S. Karunaratne, K. Hanson, N. A. Hamilton, R. D. Teasdale, LOCATE: A mammalian protein subcellular localization database, Nucleic Acids Res., 36 (2007), D230-D233.
[28]	M. Magrane, Uniprot knowledgebase: A hub of integrated protein data, Database, 2011 (2011).
[29]	S. J. Li, M. Peng, H. Li, B. S. Liu, C. Wang, J. R. Wu, et al., Sys-bodyfluid: A systematical database for human body fluid proteome research, Nucleic Acids Res., 37 (2009), 907-912.
[30]	S. Hu, J. A. Loo, D. T. Wong, Human saliva proteome analysis and disease biomarker discovery, Expert Rev. Proteom., 4 (2007), 531-538.
[31]	P. Denny, F. K. Hagen, M. Hardt, L. Liao, W. Yan, M. Arellanno, et al., The proteomes of human parotid and submandibular/sublingual gland salivas collected as the ductal secretions, J. Proteom. Res., 7 (2008), 1994-2006.
[32]	S. El-Gebali, J. Mistry, A. Bateman, S. R. Eddy, A. Luciani, S. C. Potter, et al., The Pfam protein families database in 2019, Nucleic Acids Res., 47 (2019), D427-D432.

Reader Comments

Your name:*

Email:*
© 2020 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)