
We evaluated the effects of solvents with different polarities—methylene chloride (MC), methanol (MT), and hexane (HE) on the extraction of compounds from Mexican red pitaya seed oil. The fatty acid composition and the structural, rheological, and thermal properties of the different extracts were characterized. The results indicated that the highest yield of extraction was generated for MC (26.96%), as well as the greatest amount of Mono and Polyunsaturated fatty acids, while the lowest yield was for MT (16.86%). The antioxidant activity was greater in the MT treatment due to extractable compounds from high polarity. The generated extracts contained unsaturated fatty acids, mostly oleic and linoleic acids, and saturated fatty acids such as palmitic acid. The lowest solidification temperature was −6.35 ℃ for MC due to its fatty acid composition, and the degradation temperature was around 240 ℃. The viscosity is a quality parameter; the highest level was generated for the MC treatment, which was significantly different from HE and MT. The composition of the extracts was analyzed using the FT-IR spectroscopy and showed the typical characteristic of absorption bands for triglycerides with high frequency in bands 2852 cm−1 and 2924 cm−1, which indicated that the samples were rich in unsaturated and polyunsaturated acids. These results suggested that pitaya seed oil is an excellent alternative source of essential fatty acids with potential physiological benefits.
Citation: David Neder-Suárez, Daniel Lardizabal-Gutierrez, Nubia Amaya-Olivas, León Raúl Hernández-Ochoa, Jesus Alberto Vázquez-Rodríguez, Miguel Á. Sanchez-Madrigal, Ivan Salmerón-Ochoa, Armando Quintero-Ramos. Effects of the extraction of fatty acids and thermal/rheological properties of Mexican red pitaya oil[J]. AIMS Agriculture and Food, 2024, 9(1): 304-316. doi: 10.3934/agrfood.2024018
[1] | Pingping Sun, Yongbing Chen, Bo Liu, Yanxin Gao, Ye Han, Fei He, Jinchao Ji . DeepMRMP: A new predictor for multiple types of RNA modification sites using deep learning. Mathematical Biosciences and Engineering, 2019, 16(6): 6231-6241. doi: 10.3934/mbe.2019310 |
[2] | Hasan Zulfiqar, Rida Sarwar Khan, Farwa Hassan, Kyle Hippe, Cassandra Hunt, Hui Ding, Xiao-Ming Song, Renzhi Cao . Computational identification of N4-methylcytosine sites in the mouse genome with machine-learning method. Mathematical Biosciences and Engineering, 2021, 18(4): 3348-3363. doi: 10.3934/mbe.2021167 |
[3] | Lei Chen, Ruyun Qu, Xintong Liu . Improved multi-label classifiers for predicting protein subcellular localization. Mathematical Biosciences and Engineering, 2024, 21(1): 214-236. doi: 10.3934/mbe.2024010 |
[4] | Yuanqian Yao, Jianlin Lv, Guangyao Wang, Xiaohua Hong . Multi-omics analysis and validation of the tumor microenvironment of hepatocellular carcinoma under RNA modification patterns. Mathematical Biosciences and Engineering, 2023, 20(10): 18318-18344. doi: 10.3934/mbe.2023814 |
[5] | Mingshuai Chen, Xin Zhang, Ying Ju, Qing Liu, Yijie Ding . iPseU-TWSVM: Identification of RNA pseudouridine sites based on TWSVM. Mathematical Biosciences and Engineering, 2022, 19(12): 13829-13850. doi: 10.3934/mbe.2022644 |
[6] | Jianhua Jia, Mingwei Sun, Genqiang Wu, Wangren Qiu . DeepDN_iGlu: prediction of lysine glutarylation sites based on attention residual learning method and DenseNet. Mathematical Biosciences and Engineering, 2023, 20(2): 2815-2830. doi: 10.3934/mbe.2023132 |
[7] | Carsten Conradi, Elisenda Feliu, Maya Mincheva . On the existence of Hopf bifurcations in the sequential and distributive double phosphorylation cycle. Mathematical Biosciences and Engineering, 2020, 17(1): 494-513. doi: 10.3934/mbe.2020027 |
[8] | Yunyun Liang, Shengli Zhang, Huijuan Qiao, Yinan Cheng . iEnhancer-MFGBDT: Identifying enhancers and their strength by fusing multiple features and gradient boosting decision tree. Mathematical Biosciences and Engineering, 2021, 18(6): 8797-8814. doi: 10.3934/mbe.2021434 |
[9] | Ying Xu, Jinyong Cheng . Secondary structure prediction of protein based on multi scale convolutional attention neural networks. Mathematical Biosciences and Engineering, 2021, 18(4): 3404-3422. doi: 10.3934/mbe.2021170 |
[10] | Yuqing Qian, Tingting Shang, Fei Guo, Chunliang Wang, Zhiming Cui, Yijie Ding, Hongjie Wu . Identification of DNA-binding protein based multiple kernel model. Mathematical Biosciences and Engineering, 2023, 20(7): 13149-13170. doi: 10.3934/mbe.2023586 |
We evaluated the effects of solvents with different polarities—methylene chloride (MC), methanol (MT), and hexane (HE) on the extraction of compounds from Mexican red pitaya seed oil. The fatty acid composition and the structural, rheological, and thermal properties of the different extracts were characterized. The results indicated that the highest yield of extraction was generated for MC (26.96%), as well as the greatest amount of Mono and Polyunsaturated fatty acids, while the lowest yield was for MT (16.86%). The antioxidant activity was greater in the MT treatment due to extractable compounds from high polarity. The generated extracts contained unsaturated fatty acids, mostly oleic and linoleic acids, and saturated fatty acids such as palmitic acid. The lowest solidification temperature was −6.35 ℃ for MC due to its fatty acid composition, and the degradation temperature was around 240 ℃. The viscosity is a quality parameter; the highest level was generated for the MC treatment, which was significantly different from HE and MT. The composition of the extracts was analyzed using the FT-IR spectroscopy and showed the typical characteristic of absorption bands for triglycerides with high frequency in bands 2852 cm−1 and 2924 cm−1, which indicated that the samples were rich in unsaturated and polyunsaturated acids. These results suggested that pitaya seed oil is an excellent alternative source of essential fatty acids with potential physiological benefits.
Protein post-translational modification is an important chemical process, which plays a key role in regulating cell functions [1] and also changes the physical and chemical properties of protein. More than 400 post-translational modifications including methylation [2], acetylation [3], phosphorylation [4], and S-nitrosylation (SNO) [5] have been discovered so far. As SNO is a reversible post-translational modification of proteins, a large number of studies have shown that SNO plays an important role in multiple biological processes such as redox signal transduction [6], cell signal transduction [7], cell senescence [8], and transcription [9]. SNO is also related to many human diseases such as cancer [10], Alzheimer's disease [11], and chronic renal failure [12]. Therefore, a well-grounded understanding of SNO is of great significance for the study of basic biological processes [9,13] and the development of drugs [14]. In recent years, many SNO sites have been identified through molecular signals [15,16], but identification of SNO sites still faces some challenges including low accuracy, time-consuming and labor-intensive. With the continuous development of computer technology, a large number of computational models have been used to predict the specific sites of SNO modification.
Many post-translational modifications of proteins have been detected by a variety of computational models. Qiu identified phosphorylated [17] and acetylated [18] proteins with GO notations. GPS-SNO [19], SNOSite [20], iSNO-PseAAC [21], PreSNO [5] and RecSNO [22] have been applied to the prediction of SNO sites. The GPS-SNO, SNOSite and iSNO-PseAAC models use relatively small data sets. In addition, many negative samples in these data sets are now experimentally verified as positive samples. The data sets used by PreSNO and RecSNO are relatively large and new, but there is still room for improvement in the performance of the model.
On the basis of previous research, this work established two models for predicting SNO proteins and sites. In predicting SNO proteins, a bag-of-words model has been proposed on the basis of KNN scoring matrix obtained from proteins' GO annotation information [18], PseAAC [23,24] of amino acids sequence. Fusion of multiple features can more comprehensively reflect the information of the protein sequence and improve the prediction results. A combination of oversampling technique and random deletion method are applied to balance the training set since the issue is involved in imbalanced data sets. In predicting SNO sites, two feature extraction methods, TPC [25] and CKSAAP [26], are used to extract the features of protein sequence fragments. In order to eliminate the redundancy and noise information of the original feature space, elastic nets [27] are used to reduce the dimensionality of the feature space after the fusion strategy is performed on the original features. Random Forest severed as the classifiers and be verified with 5-fold cross-validation. The specific flow chart is shown in Figure 1.
To obtain a scientifical prediction result, a strict benchmark data set is essential. The UniProKB has been accepted by most bioinformatics researchers. Here, the negative samples are extracted from the UniProKB and the positive sample are extracted from Xie's [28], which is a high-quality data set based on extensive literature research. The protein sequence can be expressed as:
P=R1R2R3⋯Ri⋯RL | (1) |
where Ri represents the i-th amino acid residue, and L represents the length of the protein sequence.
In order to identify SNO proteins, we constructed a benchmark data set similar to dataset of Hasan et al. [5], which consists of 3113 SNO proteins. Every one of the positive samples, i.e., SNO proteins, has at least one SNO site. For negative samples, we randomly selected 18, 047 proteins without any SNO site from the UniProKB. In order to make the results more rigorous, the CD-HIT was used to remove 30% of the 3113 positive samples and 18, 047 negative samples. Finally, 2192 positive samples and 7809 negative samples are collected in the proposed benchmark data set.
The benchmark data set for predicting SNO sites are the same as Hasan et al. [5], which consists of 3383 positive samples and 3365 negative samples. A potential SNO(C) site-containing peptide sample can be generally expressed by
−Pξ=R−ξR−(ξ−1)⋯R−2R−1CR+1R+2⋯R+(ξ−1)R+ξ | (2) |
where the subscript ξ is an integer, R−ξ represents the ξ-th upstream amino acid residue from the center, the R+ξ the ξ-th downstream amino acid residue, and so forth. If the number of left or right residues of the center C is less than ξ, then the pseudo amino acid "X" would be used to supplement the sequence. The (2ξ+1)-tuple peptide sample −Pξ can be further classified into the following two categories:
¯Pξξ∈{¯P+ξ,ifitscenterisaSONsite¯P¯ξ,otherwise | (3) |
where ¯P+ξ denotes a true SNO segment C with at its center, ¯P¯ξ a corresponding false SNO segment, and the symbol ∈ means "a member of" in the set theory.
GO-KNN [18] features were extracted on the basis of the GO annotations of proteins. In this work, we need to find out the GO terms of all protein sequences and calculate the distance between proteins. Take protein P1 as an example, for anyone of other proteins, for example, P2, then their GO terms can be listed as P1GO={GO11,GO12,⋯,GO1M} and P2GO={GO21,GO22,⋯,GO2N} are obtained. If there is no GO term for a protein, we will replace it with GO terms of its homologous protein. The distance between two proteins can be calculated with Eq (4):
Distance(P1,P2)=1−⌊P1GO∩P2GO⌋⌊P1GO∪P2GO⌋ | (4) |
where, GO1i and GO2i represent the i-th GO of P1 and P2, respectively. The M and N represent the numbers of GO, respectively, ∪and ∩ are the union and intersection in the set theory, and ⌊ ⌋ represents the number of elements in the set. Then, the GO-KNN features could be extracted according to the following steps: 1) Sorting the calculated distances in ascending order; 2) Selecting the first k near neighbors of the test protein; 3) Calculating the percentage of positive samples in the k neighbors. In this study, k were selected as 2, 4, 8, 16, 62, 64, 128, 256, 512, 1024. In this way, a 10-dimensional feature vector (x1,x2,⋯,x10) could represents the protein P1.
A bag-of-words model [29] based on the physical and chemical properties of protein has been used in identifying GPCR-drug interaction. The main steps are listed as follows: 1) Encoding the protein sequence with its physical and chemical properties. Up to now, scientists have obtained various physical and chemical properties of 20 common amino acids [30]. After careful experimental comparison, hydrophilicity was selected as an indicator for the proposed model. 2) Designing wordbooks for protein. When the window sizes are 1, 2 and 3, and the step size of the moving window is 1, the coding sequence is divided into segments of different lengths. Segments of length 1 form wordbook WB1, segments of length 2 form wordbook WB2, and segments of length 3 form wordbook WB3. When the window size is 2, the step size of moving the window is still 1. But the window at this time is different from the above, it is separated by an amino acid. At this time, the coding sequence is divided into fragments of length 2, and these fragments form the wordbook WB4. 3) Clustering the word books. We divided the words in the wordbook WB1 into 20 sub-groups according to the types of amino acids. Words in WB2,WB3 and WB4 were clustered with K-means algorithm. The numbers of clusters were 16, 62 and 16. 4) Calculating the ratio of the number of each amino acid to the total number of words in the vocabulary with Eq (5).
XWBji=XWBjiN i=1,…,K j=1,2,3,4 | (5) |
here, K is the number of clusters in the wordbook WBj, XWBji is the number of words in the i-th category of the wordbook WBj, and N is the total number of words in the wordbook WBj. Then a 114-D feature vector was formed for a given protein sequence, i.e. (XWB11,…,XWB120,XWB21,…,XWB216,XWB31,...XWB362,XWB41...,XWB416).
PseAAC [23,24] is a very popular feature for bioinformatics. In this work, six physical and chemical properties of hydrophobicity, hydrophilicity, molecular side chain mass, PK1, PK2 and PI are used. We first used Eq (6) to transform the original physical and chemical properties of amino acids:
Wa(i)=W0a(i)−∑20i=1W0a(i)20√∑20i=1[W0a(i)−∑20i=1W0a(i)20]220 | (6) |
where, a∈{1,2,⋯,6} and i∈{1,2,⋯,20}. W0a(i) represents the value of the ath original physical and chemical properties of the ith amino acid. We substitute the values of the transformed physical and chemical properties with Eq (7):
Θ(Ri,Rj)=16∑6a=1[Wa(Rj)−Wa(Ri)]2 | (7) |
where, W1(Rj)represents the value of hydrophobicity of Rj. By analogy, W6(Ri)represents the PI value of Ri. Then the correlation factor of each layer can be obtained by using the Eq (8):
θλ=1L−λ∑L−λi=1Θ(Ri,Ri+λ)λ<L | (8) |
where θλ represents the correlation factor of the λ-th layer of the protein sequence. Finally, the protein sequence is converted into a feature by Eq (9):
xi={fi∑20i=1fi+ω∑λj=1θj(1≤i≤20)ωθi−20∑20i=1fi+ω∑λj=1θj(20+1≤i≤20+λ) | (9) |
here, fi represents the frequency of the i-th amino acid, ω is 0.5, and λ is 5. In this way, a 25-dimensional feature vector is formed.
In order to reduce the adverse effect of unbalanced data on the performance of the model, many methods for dealing with unbalanced data have been proposed, such as Synthetic Minority Oversampling Technique [31] (SMOTE) and Random Under Sampler [32] (RUS). SMOTE is a method proposed by Chawla et al. It has been used to predict protein sites [27] and improve the prognostic assessment of lung cancer [33]. RUS is a very simple and popular method of under-sampling. It can be used in pediatric pneumonia detection [34] and convolutional neural network performance improvement issues [35]. In this study, we combined these two methods to process the data. SMOTE is used to oversample the positive samples, and RUS is used to under-sample the negative samples. In the end, the number of processed positive samples is equal to negative samples. The specific process is shown in Figure 2.
CSKAAP [25] has been widely used in protein site prediction [26] since it can effectively express internal laws for a given protein sequence. The protein fragment is composed of 20 common amino acids and a pseudo amino acid, which contains 441 residue pairs (AA, AC, ..., XX) for each l. Here l represents the space between each residue pair. The following formula is used to calculate the characteristics of the fragment:
(NAANT,NACNT,NADNT,⋯,NXXNT)441 | (10) |
here NAA,NAC,⋯ represent the number of times the corresponding amino acid pair appears in the fragment, L is the length of the protein fragment. NT=L−l−1. In this study, the values of l are 0, 1, 2, 3, 4, and the corresponding NT are 40, 39, 38, 37 and 36, respectively. Then, a 2205-D feature vector is formed.
Based on the structural properties of proteins, researchers have proposed the tripeptide composition (TPC). It has been used to predict protein subcellular localization [36] identification of plasmodium mitochondrial proteins [37]. TPC calculates the frequency of three consecutive amino acids, and then a protein fragment can be represented by a 9261-dimensional vector.
Pi=Ni∑92611Ni | (11) |
where Ni represents the total number of i-th in 9261 tripeptides.
The elastic net proposed by Zou and Hastie [38] is an effective feature selection method. By introducing the L1,L2 norm into a simple linear regression model, the elastic net can not only perform continuous shrinkage and automatically select variables at the same time, but also predict related variables. At present, elastic nets have been widely used in protein site prediction [27,39] and achieved good results.
In this study, four indicators were used to evaluate the performance of the models. They are accuracy (ACC), sensitivity (SN), specificity (SP) and Matthews correlation coefficient (MCC) [40], which are defined by Eq (12):
{Sn=TPTP+FNSp=TNTN+FPACC=TP+TNTP+FP+TN+FNMCC=TP×TN−FP×FN√(TP+FP)×(TP+FN)×(TN+FP)×(TN+FN) | (12) |
In predicting SNO proteins, TP indicates the number of proteins that are predicted to have SNO sites and actually have SNO sites, and TN is the number of proteins that are predicted to have no SNO sites that are actually not have SNO sites. FP is the number of proteins without SNO sites but predicted to have SNO sites, FN is the number of proteins with SNO sites but predicted to have no SNO sites. In addition, the area under the ROC curve AUC is also used to evaluate this model.
In predicting SNO sites, TP indicates the number of actual SNO sites predicted to be SNO sites, and TN indicates the number of non-SNO sites predicted to be not SNO sites. FP is the number of non-SNO sites predicted to be SNO sites, and FN is the number of actual SNO sites predicted to be non-SNO sites.
Random Forest [41] is an algorithm that integrates multiple trees through the idea of ensemble learning. Its basic unit is decision tree. As a highly flexible machine learning algorithm, Random Forest (RF) has been widely used in data analysis [42], biological information [43] and technological development [44].
Naive Bayes (NB) [45] is a simple and effective classifier, which is widely used in software defect prediction [46], medical diagnosis [47] and biological information [48]. NB is based on the Bayes theorem and the assumption of the conditional independence of features, which greatly reduces the complexity of the classification algorithm.
K Nearest Neighbor (KNN) [49] is one of the supervised machine learning algorithms, which is widely used in face recognition [50], disease research [51] and engineering applications [52]. Its main idea is to judge the category of the predicted value based on the category of the k points closest to the predicted value.
XGBoost [53,54] is an improved algorithm for boosting based on GBDT [55]. XGBoost is an integrated lifting algorithm that integrates many basic models to form a strong model. Because of its advantages such as good prediction effect and high training efficiency, XGBoost has been widely used in the field of data analysis.
In this research, GO-KNN, BOW and PseAAC three kinds of feature extraction methods were used to encode the protein sequence, and obtained 10-D, 114-D and 25-D feature vectors, respectively. These three kinds of features were fused into a 149-D feature vector ALL. Through the 5-fold cross-validation, the prediction results obtained by different feature extraction are shown in Table 1.
Feature | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
GO-KNN | 82.72 | 48.36 | 92.37 | 0.4533 | 0.8521 |
BOW | 78.83 | 13.04 | 97.30 | 0.1969 | 0.7359 |
PseAAC | 79.38 | 19.43 | 96.22 | 0.2503 | 0.7616 |
ALL | 83.77 | 49.49 | 93.40 | 0.4840 | 0.8593 |
It can be seen from Table 1 that different features obtained varied prediction results. Among the three methods, GO-KNN has the highest ACC, Sn, MCC and AUC, of which are 82.72%, 48.36%, 0.4533 and 0.8521 respectively. The ACC, Sn, MCC and AUC of BOW are the lowest, of which are 78.83%, 13.04%, 0.1969 and 0.7359, respectively. But the Sp of BOW is 97.30%, which is the highest. After combining these three characteristics, ACC, Sn, Sp, MCC, AUC are 83.77%, 49.49%, 93.40%, 0.4840, 0.8593, respectively. Among them, ACC, Sn, MCC and AUC are all higher than those produced by GO-KNN. The results show that multi-feature fusion can improve a number of indicators. In order to better analyze the influence of different features on the prediction of SNO proteins, the prediction results obtained from the three features and their fusion features are shown in Figure 3.
It can be seen from Figure 3 that the three features and their fusion affect the five evaluation indicators to some extent. They are less effective on Sn and MCC, and better on ACC, Sp and AUC. Comparing these four feature codes, the ACC, Sn, MCC and AUC of the fusion feature ALL are improved. Multi-feature fusion can reflect sequence information more comprehensively, thereby improving prediction ability. Therefore, multi-feature fusion can be used to predict SNO proteins.
Here, SMOTE and RUS are denoted as SR balancer. We input the pre-balanced and post-balanced data sets into the model, and passed the 5-fold cross-validation to obtain the prediction results of ACC, Sn, Sp, MCC, AUC on the balanced and unbalanced data sets, as shown in the Table 2.
Acc (%) | Sn (%) | Sp (%) | MCC | AUC | |
Imbalance | 83.77 | 49.49 | 93.40 | 0.4840 | 0.8593 |
Balance | 81.84 | 70.82 | 84.93 | 0.5178 | 0.8635 |
It can be seen from Table 2 that the balanced Sn and Sp are relatively balanced. In addition, Sn, MCC and AUC have improved. Therefore, in summary, it is very necessary to balance the dataset.
Classifiers play an important role in model prediction. This work used the above four classifiers to identify SNO proteins. After 5-fold cross-validation, the results of each classifier for ACC, Sn, Sp, MCC and AUC are shown in Table 3. It can be seen from Table 3 that the effect of random forest on various evaluation indicators is the best. In order to better compare the effects of different classifiers, the prediction results of the four classifiers are shown in Figure 4.
Algorithms | Acc (%) | Sn (%) | Sp (%) | MCC | AUC (%) |
RF | 81.84 | 70.82 | 84.93 | 0.5178 | 0.8635 |
NB | 63.81 | 78.37 | 59.73 | 0.3154 | 0.7710 |
KNN | 71.97 | 83.44 | 68.75 | 0.4366 | 0.8360 |
XGBoost | 80.73 | 70.07 | 83.72 | 0.4953 | 0.8553 |
The area under the ROC curve can evaluate the predictive performance of the model. It can be seen from Figure 4 that when the random forest is used as a classifier, the area under the ROC curve is the largest. Therefore, random forest is the best choice for the proposed model.
In this study, two kinds of features, CKSAAP and TPC, were used, and the 2205-dimension and 9261-dimension feature vectors were obtained on the basis of above-mentioned algorithms. In order to better reflect the information of protein fragments, these features are fused into a 11, 466-dimension feature vector. Through 5-fold cross-validation, the prediction results obtained by different feature extraction are shown in Table 4.
Feature | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
CKSAAP | 73.97 | 83.67 | 64.27 | 0.4891 | 0.8036 |
TPC | 71.38 | 66.07 | 76.74 | 0.4305 | 0.8069 |
ALL | 75.36 | 86.39 | 64.31 | 0.5201 | 0.8196 |
It can be seen from Table 4 that the ACC, Sn and MCC of CKSAAP are higher than those of TPC. TPC performs better than CKSAAP on Sp and AUC. After feature fusion, Acc, Sn, MCC and AUC are all higher than single feature. Therefore, feature fusion is necessary for this issue.
Multi-information fusion can more comprehensively extract protein sequence information, but redundancy and noise information will also be generated. The dimensionality reduction method can not only retain important features, but the computational efficiency of the model will also be improved. In this paper, elastic net was used to reduce the dimensionality of the fused feature data set, and obtain the feature subset of 704. After 5-fold cross-validation, the prediction results of Random Forest are shown in Table 5.
Acc (%) | Sn (%) | Sp (%) | MCC | AUC | |
All | 75.36 | 86.39 | 64.31 | 0.5201 | 0.8196 |
Elastic net | 76.02 | 85.68 | 66.33 | 0.5304 | 0.8260 |
The features after dimensionality reduction using elastic nets, except for Sn, all other evaluation indicators have been improved. In addition, because the feature dimension is greatly reduced after dimensionality reduction, the efficiency of the model is also significantly improved.
Four kinds of classifiers, Random Forest, Naive Bayes, K-Nearest Neighbor and XGBoost, were tested in this work for predicting SNO sites. After 5-fold cross-validation, the results were shown in Table 6. From Table 6 we can get that Naive Bayes and K-Nearest Neighbors are relatively inferior. Except for Sp, all indicators of Random Forest were the best. In order to evaluate the performance of the classifier more comprehensively, the ROC curves of different classifiers are shown in Figure 5.
Algorithms | Acc (%) | Sn (%) | Sp (%) | MCC | AUC (%) |
RF | 76.02 | 85.68 | 66.33 | 0.5304 | 0.8260 |
NB | 69.74 | 79.46 | 59.98 | 0.4022 | 0.7605 |
KNN | 63.63 | 46.39 | 81.00 | 0.2923 | 0.7246 |
XGBoost | 72.88 | 74.37 | 71.40 | 0.4580 | 0.8015 |
From Figure 5, we can clearly see that the area under the ROC curve of the random forest is the largest. Therefore, random forest has been selected as the classifier of the proposed model.
To further evaluate the performance of this model, and we compared it with the PreSNO and RecSNO models. The prediction results of three different methods for the same data set are shown in Table 7. From Table 7, we can see that the ACC, Sn and MCC models of this model are all the highest. In addition, the Sp and AUC of the model in this paper also have good results. Therefore, the performance of this model is better than PreSNO and RecSNO.
Feature | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
PreSNO | 70% | 54% | 86% | 0.42 | 0.84 |
RecSNO | 72% | 79% | 66% | 0.45 | 0.79 |
RF-SNOPS | 76.02 | 85.68 | 66.33 | 0.5304 | 0.8260 |
In order to identify SNO proteins, we used GO-KNN, BOW and PseAAC to extract the sequence information. GO-KNN extracted KNN neighbor information based on protein GO information, and BOW and PseAAC extracted protein sequence information based on physical and chemical properties. In addition, we used the SR balancer to process the unbalanced data set, reduce the negative impact of the unbalance on the model. Finally, Random Forest was used to make predictions. For predicting SNO sites, CKSAAP and TPC were used to extract protein fragment information. In order to improve the computational efficiency and eliminate the redundancy and noise generated by the fusion features, we used elastic nets to reduce the dimensionality of the fusion features. These processes only need to require calculation models without any physical and chemical experiments, which can save experimental costs and improves work efficiency. We hope that this work will be helpful for solving biological problems with computational methods.
This work was supported by the grants from the National Natural Science Foundation of China (No. 31760315, 62162032, 61761023), Natural Science Foundation of Jiangxi Province, China (NO. 20202BAB202007).
The authors have declared that no competing interest exists.
[1] | Rodríguez-Félix A, Fortiz-Hernández J, Tortoledo-Ortiz O (2019) Physico-chemical characteristics, and bioactive compounds of red fruits of sweet pitaya (Stenocereus thurberi). JPACD 21: 87–100. |
[2] |
Paśko P, Galanty A, Zagrodzki P, et al. (2021) Bioactivity and cytotoxicity of different species of pitaya fruits–A comparative study with advanced chemometric analysis. Food Biosci 40: 100888. https://doi.org/10.1016/j.fbio.2021.100888 doi: 10.1016/j.fbio.2021.100888
![]() |
[3] | Quiroz-González B, García-Mateos R, Corrales-García GGE, et al. (2018) Pitaya (Stenocereus spp.): An under-utilized fruit. JPACD 20: 82–100. |
[4] |
García-Cruz L, Valle-Guadarrama S, Guerra-Ramírez D, et al. (2022) Cultivation, quality attributes, postharvest behavior, bioactive compounds, and uses of Stenocereus: A review. Sci Hortic 304: 111336. https://doi.org/10.1016/J.SCIENTA.2022.111336 doi: 10.1016/J.SCIENTA.2022.111336
![]() |
[5] |
Lim HK, Tan CP, Karim R, et al. (2010) Chemical composition and DSC thermal properties of two species of Hylocereus cacti seed oil: Hylocereus undatus and Hylocereus polyrhizus. Food Chem 119: 1326–1331. https://doi.org/10.1016/j.foodchem.2009.09.002 doi: 10.1016/j.foodchem.2009.09.002
![]() |
[6] |
Liu Y, Tu X, Li Y, et al. (2022) Analysis of lipids in pitaya seed oil by ultra-performance liquid chromatography–time-of-flight tandem mass spectrometry. Foods 11: 2988. https://doi.org/10.3390/foods11192988 doi: 10.3390/foods11192988
![]() |
[7] |
Canfora EE, Jocken JW, Blaak EE (2015) Short-chain fatty acids in control of body weight and insulin sensitivity. Nat Rev Endocrinol 11: 577–591. https://doi.org/10.1038/nrendo.2015.128 doi: 10.1038/nrendo.2015.128
![]() |
[8] |
Villalobos-Gutiérrez MG, Schweiggert RM, Carle R, et al. (2012) Chemical characterization of Central American pitaya (Hylocereus sp.) seeds and seed oil. CYTA-J Food 10: 78–83. https://doi.org/10.1080/19476337.2011.580063 doi: 10.1080/19476337.2011.580063
![]() |
[9] |
Ariffin AA, Bakar J, Tan CP, et al. (2009). Essential fatty acids of pitaya (dragon fruit) seed oil. Food Chem 114: 561–564. https://doi.org/10.1016/j.foodchem.2008.09.108 doi: 10.1016/j.foodchem.2008.09.108
![]() |
[10] | Abdullah A, Gani SSA, Mokhtar NFM, et al. (2018) Supercritical carbon dioxide extraction of red pitaya (Hylocereus polyrhizus) seeds: Response surface optimization, fatty acid composition and physicochemical properties. Malays Appl Biol 47: 39–46. |
[11] | AOAC (1998) Official Methods of Analysis of the Association of Official Analytical Chemists, 15 Eds., Washington: Association of Official Analytical Chemists, Inc. |
[12] |
Snyder LR (1974) Classification of the solvent properties of common liquids. J Chromatogr A 92: 223–230. https://doi.org/10.1016/S0021-9673(00)85732-5 doi: 10.1016/S0021-9673(00)85732-5
![]() |
[13] |
Neder-Suárez D, Lardizabal-Gutiérrez D, Zazueta-Morales J de J, et al. (2021) Anthocyanins and functional compounds change in a third-generation snacks prepared using extruded blue maize, black bean, and chard: An optimization. Antioxidants 10: 1368. https://doi.org/10.3390/antiox10091368 doi: 10.3390/antiox10091368
![]() |
[14] |
Rivera-Rangel LR, Aguilera-Campos KI, García-Triana A, et al. (2018) Comparison of oil content and fatty acids profile of Western Schley, Wichita, and Native pecan nuts cultured in Chihuahua, Mexico. J Lipids 2018: 4781345. https://doi.org/10.1155/2018/4781345 doi: 10.1155/2018/4781345
![]() |
[15] |
Zulkifli SA, Gani A, Zaidan SS, et al. (2020) Optimization of total phenolic and flavonoid contents of defatted pitaya (Hylocereus polyrhizus) Seed extract and its antioxidant properties. Molecules 25: 787. https://doi.org/10.3390/molecules25040787 doi: 10.3390/molecules25040787
![]() |
[16] |
Shi F, Jiang ZB, Xu J, et al. (2022) Optimized extraction of phenolic antioxidants from red pitaya (Hylocereus polyrhizus) seeds by subcritical water extraction using response surface methodology. Food Measure 16: 2240–2258. https://doi.org/10.1007/s11694-021-01212-1 doi: 10.1007/s11694-021-01212-1
![]() |
[17] |
Cao W, Wang Y, Shehzad Q, et al. (2022) Effect of different solvents on the extraction of oil from peony seeds (Paeonia suffruticosa Andr.): Oil yield, fatty acids composition, minor components, and antioxidant capacity. J Oleo Sci 71: 333–342. https://doi.org/10.5650/jos.ess21274 doi: 10.5650/jos.ess21274
![]() |
[18] |
Alrashidi M, Derawi D, Salimona J, et al. (2022) The effects of different extraction solvents on the yield and antioxidant properties of Nigella sativa oil from Saudi Arabia. J Taibah Univ Sci 16: 330–336. https://doi.org/10.1080/16583655.2022.2057673 doi: 10.1080/16583655.2022.2057673
![]() |
[19] |
Wang Y, Su Y, Shehzad Q, et al. (2023) Comparative study on quality characteristics of Bischofia polycarpa seed oil by different solvents: Lipid composition, phytochemicals, and antioxidant activity. Food Chem X 17: 100588. https://doi.org/10.1016/j.fochx.2023.100588 doi: 10.1016/j.fochx.2023.100588
![]() |
[20] | Abdullah F, Ismail R, Shehzad Q, et al. (2018) Total phenolic contents and antioxidant activity of palm oils and palm kernel oils at various refining processes. J Oil Palm Res 30: 682–692. |
[21] | Giuffrè AM, Zappia C, Capocasale M (2017) Tomato seed oil: A comparison of extraction systems and solvents on its biodiesel and edible properties. Rivista Italiana Delle Sostanze Grasse 94: 149–160. |
[22] |
Li Y, Fine F, Fabiano-Tixier AS, et al. (2014) Evaluation of alternative solvents for improvement of oil extraction from rapeseeds. CR Chim 17: 242–251. https://doi.org/10.1016/j.crci.2013.09.002 doi: 10.1016/j.crci.2013.09.002
![]() |
[23] |
Kimura ET, Ebert DM, Dodge PW (1971) Acute toxicity and limits of solvent residue for sixteen organic solvents. Toxicol Appl Pharmacol 19: 699–704. https://doi.org/10.1016/0041-008X(71)90301-2 doi: 10.1016/0041-008X(71)90301-2
![]() |
[24] |
Ünver A (2023) Antioxidant properties, oxidative stability, and fatty acid profile of pitaya fruit (Hylocereus polyrhizus and Hylocereus undatus) seeds cultivated in Turkey. BioResources 18: 3342–3356. https://doi.org/10.15376/biores.18.2.3342-3356 doi: 10.15376/biores.18.2.3342-3356
![]() |
[25] |
Al Juhaimi F, Uslu N, Babiker EE, et al. (2019) The effect of different solvent types and extraction methods on oil yields and fatty acid composition of safflower seed. J Oleo Sci 68: 1099–1104. https://doi.org/10.5650/jos.ess19131 doi: 10.5650/jos.ess19131
![]() |
[26] |
Tir R, Dutta PC, Badjah-Hadj-Ahmed AY, et al. (2012) Effect of the extraction solvent polarity on the sesame seeds oil composition. Eur J Lipid Sci Technol 114: 1427–1438. https://doi.org/10.1002/ejlt.201200129 doi: 10.1002/ejlt.201200129
![]() |
[27] |
Špika MJ, Perica S, Žanetić M, et al. (2021) Virgin olive oil phenols, fatty acid composition and sensory profile: Can cultivar overpower environmental and ripening effect? Antioxidants: 689. https://doi.org/10.3390/antiox10050689 doi: 10.3390/antiox10050689
![]() |
[28] |
Vannice G, Rasmussen H (2014) Position of the academy of nutrition and dietetics: Dietary fatty acids for healthy adults. J Acad Nutr Diet 114: 136–153. https://doi.org/10.1016/j.jand.2013.11.001 doi: 10.1016/j.jand.2013.11.001
![]() |
[29] |
Naghshineh M, Ariffin AA, Ghazali HM, et al. (2010) Effect of saturated/unsaturated fatty acid ratio on physicochemical properties of palm olein-olive oil blend. J Am Oil Chem Soc 87: 255–262. https://doi.org/10.1007/s11746-009-1495-z doi: 10.1007/s11746-009-1495-z
![]() |
[30] |
Brinkmann B (2000) Quality criteria of industrial frying oils and fats. Eur J Lipid Sci Technol 102: 539–541. https://doi.org/10.1002/1438-9312(200009)102:8/9<539::aid-ejlt539>3.0.co;2-b doi: 10.1002/1438-9312(200009)102:8/9<539::aid-ejlt539>3.0.co;2-b
![]() |
[31] |
Bansal G, Zhou W, Barlow PJ, et al. (2010) Review of rapid tests available for measuring the quality changes in frying oils and comparison with standard methods. Crit Rev Food Sci Nutr 50: 503–514. https://doi.org/10.1080/10408390802544611 doi: 10.1080/10408390802544611
![]() |
[32] |
Martínez C, Jiménez A, Garrigós MC, et al. (2023) Oxidative stability of avocado snacks formulated with olive extract as an active ingredient for novel food production. Foods 12: 2382. https://doi.org/10.3390/foods12122382 doi: 10.3390/foods12122382
![]() |
[33] |
Kuligowski J, Carrión D, Quintás G, et al. (2011) Sample classification for improved performance of PLS models applied to the quality control of deep-frying oils of different botanic origins analyzed using ATR-FTIR spectroscopy. Anal Bioanal Chem 399: 1305–1314. https://doi.org/10.1007/s00216-010-4457-2 doi: 10.1007/s00216-010-4457-2
![]() |
[34] |
Zhang N, Yuan Y, Wang X, et al. (2013) Preparation and characterization of lauric–myristic–palmitic acid ternary eutectic mixtures/expanded graphite composite phase change material for thermal energy storage. Chem Eng J 231: 214–219. https://doi.org/10.1016/J.CEJ.2013.07.008 doi: 10.1016/J.CEJ.2013.07.008
![]() |
[35] | Castorena-García JH, Rojas-López M, Delgado-Macuil R, et al. (2011) análisis de pulpa y aceite de aguacate con espectroscopia infrarroja. Conciencia Tecnológica 42: 5–10. |
Feature | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
GO-KNN | 82.72 | 48.36 | 92.37 | 0.4533 | 0.8521 |
BOW | 78.83 | 13.04 | 97.30 | 0.1969 | 0.7359 |
PseAAC | 79.38 | 19.43 | 96.22 | 0.2503 | 0.7616 |
ALL | 83.77 | 49.49 | 93.40 | 0.4840 | 0.8593 |
Acc (%) | Sn (%) | Sp (%) | MCC | AUC | |
Imbalance | 83.77 | 49.49 | 93.40 | 0.4840 | 0.8593 |
Balance | 81.84 | 70.82 | 84.93 | 0.5178 | 0.8635 |
Algorithms | Acc (%) | Sn (%) | Sp (%) | MCC | AUC (%) |
RF | 81.84 | 70.82 | 84.93 | 0.5178 | 0.8635 |
NB | 63.81 | 78.37 | 59.73 | 0.3154 | 0.7710 |
KNN | 71.97 | 83.44 | 68.75 | 0.4366 | 0.8360 |
XGBoost | 80.73 | 70.07 | 83.72 | 0.4953 | 0.8553 |
Feature | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
CKSAAP | 73.97 | 83.67 | 64.27 | 0.4891 | 0.8036 |
TPC | 71.38 | 66.07 | 76.74 | 0.4305 | 0.8069 |
ALL | 75.36 | 86.39 | 64.31 | 0.5201 | 0.8196 |
Acc (%) | Sn (%) | Sp (%) | MCC | AUC | |
All | 75.36 | 86.39 | 64.31 | 0.5201 | 0.8196 |
Elastic net | 76.02 | 85.68 | 66.33 | 0.5304 | 0.8260 |
Algorithms | Acc (%) | Sn (%) | Sp (%) | MCC | AUC (%) |
RF | 76.02 | 85.68 | 66.33 | 0.5304 | 0.8260 |
NB | 69.74 | 79.46 | 59.98 | 0.4022 | 0.7605 |
KNN | 63.63 | 46.39 | 81.00 | 0.2923 | 0.7246 |
XGBoost | 72.88 | 74.37 | 71.40 | 0.4580 | 0.8015 |
Feature | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
PreSNO | 70% | 54% | 86% | 0.42 | 0.84 |
RecSNO | 72% | 79% | 66% | 0.45 | 0.79 |
RF-SNOPS | 76.02 | 85.68 | 66.33 | 0.5304 | 0.8260 |
Feature | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
GO-KNN | 82.72 | 48.36 | 92.37 | 0.4533 | 0.8521 |
BOW | 78.83 | 13.04 | 97.30 | 0.1969 | 0.7359 |
PseAAC | 79.38 | 19.43 | 96.22 | 0.2503 | 0.7616 |
ALL | 83.77 | 49.49 | 93.40 | 0.4840 | 0.8593 |
Acc (%) | Sn (%) | Sp (%) | MCC | AUC | |
Imbalance | 83.77 | 49.49 | 93.40 | 0.4840 | 0.8593 |
Balance | 81.84 | 70.82 | 84.93 | 0.5178 | 0.8635 |
Algorithms | Acc (%) | Sn (%) | Sp (%) | MCC | AUC (%) |
RF | 81.84 | 70.82 | 84.93 | 0.5178 | 0.8635 |
NB | 63.81 | 78.37 | 59.73 | 0.3154 | 0.7710 |
KNN | 71.97 | 83.44 | 68.75 | 0.4366 | 0.8360 |
XGBoost | 80.73 | 70.07 | 83.72 | 0.4953 | 0.8553 |
Feature | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
CKSAAP | 73.97 | 83.67 | 64.27 | 0.4891 | 0.8036 |
TPC | 71.38 | 66.07 | 76.74 | 0.4305 | 0.8069 |
ALL | 75.36 | 86.39 | 64.31 | 0.5201 | 0.8196 |
Acc (%) | Sn (%) | Sp (%) | MCC | AUC | |
All | 75.36 | 86.39 | 64.31 | 0.5201 | 0.8196 |
Elastic net | 76.02 | 85.68 | 66.33 | 0.5304 | 0.8260 |
Algorithms | Acc (%) | Sn (%) | Sp (%) | MCC | AUC (%) |
RF | 76.02 | 85.68 | 66.33 | 0.5304 | 0.8260 |
NB | 69.74 | 79.46 | 59.98 | 0.4022 | 0.7605 |
KNN | 63.63 | 46.39 | 81.00 | 0.2923 | 0.7246 |
XGBoost | 72.88 | 74.37 | 71.40 | 0.4580 | 0.8015 |
Feature | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
PreSNO | 70% | 54% | 86% | 0.42 | 0.84 |
RecSNO | 72% | 79% | 66% | 0.45 | 0.79 |
RF-SNOPS | 76.02 | 85.68 | 66.33 | 0.5304 | 0.8260 |