
Biological sequence analysis is an important basic research work in the field of bioinformatics. With the explosive growth of data, machine learning methods play an increasingly important role in biological sequence analysis. By constructing a classifier for prediction, the input sequence feature vector is predicted and evaluated, and the knowledge of gene structure, function and evolution is obtained from a large amount of sequence information, which lays a foundation for researchers to carry out in-depth research. At present, many machine learning methods have been applied to biological sequence analysis such as RNA gene recognition and protein secondary structure prediction. As a biological sequence, RNA plays an important biological role in the encoding, decoding, regulation and expression of genes. The analysis of RNA data is currently carried out from the aspects of structure and function, including secondary structure prediction, non-coding RNA identification and functional site prediction. Pseudouridine (У) is the most widespread and rich RNA modification and has been discovered in a variety of RNAs. It is highly essential for the study of related functional mechanisms and disease diagnosis to accurately identify У sites in RNA sequences. At present, several computational approaches have been suggested as an alternative to experimental methods to detect У sites, but there is still potential for improvement in their performance. In this study, we present a model based on twin support vector machine (TWSVM) for У site identification. The model combines a variety of feature representation techniques and uses the max-relevance and min-redundancy methods to obtain the optimum feature subset for training. The independent testing accuracy is improved by 3.4% in comparison to current advanced У site predictors. The outcomes demonstrate that our model has better generalization performance and improves the accuracy of У site identification. iPseU-TWSVM can be a helpful tool to identify У sites.
Citation: Mingshuai Chen, Xin Zhang, Ying Ju, Qing Liu, Yijie Ding. iPseU-TWSVM: Identification of RNA pseudouridine sites based on TWSVM[J]. Mathematical Biosciences and Engineering, 2022, 19(12): 13829-13850. doi: 10.3934/mbe.2022644
[1] | Hexiao Hu, Yalian Zhang, Chen Yao, Xin Guo, Zhijing Yang . Research on cost accounting of enterprise carbon emission (in China). Mathematical Biosciences and Engineering, 2022, 19(11): 11675-11692. doi: 10.3934/mbe.2022543 |
[2] | Zheng Liu, Hangxin Guo, Yuanjun Zhao, Bin Hu, Lihua Shi, Lingling Lang, Bangtong Huang . Research on the optimized route of cold chain logistics transportation of fresh products in context of energy-saving and emission reduction. Mathematical Biosciences and Engineering, 2021, 18(2): 1926-1940. doi: 10.3934/mbe.2021100 |
[3] | Baotong Wu, Qi Tang . A sustainable scheduling system for medical equipment: Towards net zero goals for green healthcare. Mathematical Biosciences and Engineering, 2023, 20(10): 18960-18986. doi: 10.3934/mbe.2023839 |
[4] | Pablo Flores-Sigüenza, Jose Antonio Marmolejo-Saucedo, Joaquina Niembro-Garcia, Victor Manuel Lopez-Sanchez . A systematic literature review of quantitative models for sustainable supply chain management. Mathematical Biosciences and Engineering, 2021, 18(3): 2206-2229. doi: 10.3934/mbe.2021111 |
[5] | Tarahom Mesri Gundoshmian, Sina Ardabili, Mako Csaba, Amir Mosavi . Modeling and optimization of the oyster mushroom growth using artificial neural network: Economic and environmental impacts. Mathematical Biosciences and Engineering, 2022, 19(10): 9749-9768. doi: 10.3934/mbe.2022453 |
[6] | Bo Dong, Alexey Luzin, Dmitry Gura . The hybrid method based on ant colony optimization algorithm in multiple factor analysis of the environmental impact of solar cell technologies. Mathematical Biosciences and Engineering, 2020, 17(6): 6342-6354. doi: 10.3934/mbe.2020334 |
[7] | Ke Hou, Jianping Sun, Minggao Dong, He Zhang, Qingqing Li . Simulation of carbon peaking process of high energy consuming manufacturing industry in Shaanxi Province: A hybrid model based on LMDI and TentSSA-ENN. Mathematical Biosciences and Engineering, 2023, 20(10): 18445-18467. doi: 10.3934/mbe.2023819 |
[8] | Mehrdad Ahmadi Kamarposhti, Ilhami Colak, Kei Eguchi . Optimal energy management of distributed generation in micro-grids using artificial bee colony algorithm. Mathematical Biosciences and Engineering, 2021, 18(6): 7402-7418. doi: 10.3934/mbe.2021366 |
[9] | Dongmei Zhang . Unveiling dynamics of urbanization, rural logistics, and carbon emissions: A study based on China's empirical data. Mathematical Biosciences and Engineering, 2024, 21(2): 2731-2752. doi: 10.3934/mbe.2024121 |
[10] | Huanyu Chen, Jizheng Yi, Aibin Chen, Guoxiong Zhou . Application of PVAR model in the study of influencing factors of carbon emissions. Mathematical Biosciences and Engineering, 2022, 19(12): 13227-13251. doi: 10.3934/mbe.2022619 |
Biological sequence analysis is an important basic research work in the field of bioinformatics. With the explosive growth of data, machine learning methods play an increasingly important role in biological sequence analysis. By constructing a classifier for prediction, the input sequence feature vector is predicted and evaluated, and the knowledge of gene structure, function and evolution is obtained from a large amount of sequence information, which lays a foundation for researchers to carry out in-depth research. At present, many machine learning methods have been applied to biological sequence analysis such as RNA gene recognition and protein secondary structure prediction. As a biological sequence, RNA plays an important biological role in the encoding, decoding, regulation and expression of genes. The analysis of RNA data is currently carried out from the aspects of structure and function, including secondary structure prediction, non-coding RNA identification and functional site prediction. Pseudouridine (У) is the most widespread and rich RNA modification and has been discovered in a variety of RNAs. It is highly essential for the study of related functional mechanisms and disease diagnosis to accurately identify У sites in RNA sequences. At present, several computational approaches have been suggested as an alternative to experimental methods to detect У sites, but there is still potential for improvement in their performance. In this study, we present a model based on twin support vector machine (TWSVM) for У site identification. The model combines a variety of feature representation techniques and uses the max-relevance and min-redundancy methods to obtain the optimum feature subset for training. The independent testing accuracy is improved by 3.4% in comparison to current advanced У site predictors. The outcomes demonstrate that our model has better generalization performance and improves the accuracy of У site identification. iPseU-TWSVM can be a helpful tool to identify У sites.
Pandemics caused by infectious diseases are becoming a constant threat in our globalized society. There are seasonal diseases like influenza, but also new diseases, often of zoonotic origin, like the recent case of COVID-19, cause by the coronavirus SARS-CoV-2. COVID-19 has become a pandemic starting in 2020 and it has left an important legacy in the form of extensive data covering various aspects relevant for the diffusion of the infection. During the COVID-19 pandemic, traditional compartmental modeling of infections, based on ordinary differential equations (ODEs), has been employed very successfully in describing the evolution of the incidence of infections. More advanced versions of the traditional models have been proposed and tested, taking advantage of the unprecedented data availability. Data on people's mobility and behavior, on the adoption of public measures, on economic restrictions and public policy, were also fundamental in trying to make sense of the pandemic as it happened. As a hindsight exercise, one of the major questions currently on the topic of infectious disease spread is how best to use the available data in transmission models in such a way that new insights can be uncovered and new lessons/conclusions can be drawn from our common recent COVID-19 experience, should we be again faced with similar situations.
Compartmental models are based on pioneering work in the early 20th century [1]. The first models where based on three compartments: susceptible (S), infectious (I) and removed (R), leading to the famous SIR model. More recently, the compartment of exposed (E) has been added to take into account the latency of the disease, leading to SEIR models. Such models are flexible enough to allow for interactions among separate geographical regions or age stratification. In the former case, each region has its own set of SEIR-type ODEs with connecting terms to reflect importation of cases from other regions. From the administrative point of view, the regional division of population (for example counties in the USA, or public health regions in the province of Ontario, Canada) is based on various socio-demographic criteria, history, etc. Established literature in disease transmission has looked at regions as cities, with commuter traffic and/or regular travel between them [2,3,4].
In general, SEIR-type models work well for large well-mixed and isolated populations within which the infections can easily spread [5]. This requirement is often not respected by administrative regional divisions. For instance, in the USA, counties are often sparsely populated or strongly connected to nearby counties by commuting population; states, instead, may encompass disconnected local areas or feature important cross-state commuting population. The main goal of our work is to recast small geographical units, like counties, in new well-mixed and isolated regions by use of available data on people's mobility. The new regions are defined by the following criteria: minimizing mobility between the new regions while creating well-mixed sub-populations in each such new region. Moreover, we introduce a notion of temporal stability of our clusters and use it to analyze the results throughout a 6 months time window. Clustering of populations is not a novel concept; researchers have studied clustering populations in order to improve government legislation, re-imagine municipal infrastructure, and observe the environmental impacts of carbon emissions [6,7]. This work is broadly focused on using individual mobility data along with other features to cluster populations based on similarity of those features [8,9]. Additional research has been done to discover high traffic areas within cities in order to understand traffic dynamics within regional populations [10]. Research into clustering regions based on interconnected mobility is less represented in the literature, to our knowledge.
Nevertheless, mobility networks have been used in conjunction with SEIR-type models in order to capture epidemiological dynamics of COVID-19 in urban populations [11,12,13]. Some of the research was focused on individual city dynamics in order to capture the spread within these smaller regions [11], while others have looked at quantifying the effect of mobility restrictions on the disease spread in Canada [12] and in other countries around the world [13]. At larger scales, mobility can be used to understand the spread of the disease across continents: for instance, the case of the second wave in 2020 in Europe has been studied in [14] while the spread in the USA has been modeled at the census division level [15]. Many other cases have been studied in the literature that cannot be briefly summarized, in all cases struggling with the need of finding appropriate sub-population characterization.
For our purpose, we will employ a simple machine learning approach to define the new regions, based on the criteria and datasets mentioned above. There are three general approaches to classification problems: supervised, semi-supervised, and unsupervised [16,17]. For the purposes of this article, we will focus on unsupervised classification, herein referred to as clustering [17]. In clustering, there is no information known regarding the true classification of data, unlike supervised and semi-supervised learning, wherein some or all of the information about the true classification is known [17]. Clustering assigns classes to objects in a dataset [17]. Many different clustering algorithms exist and have a broad scope of applications, although no one clustering method is superior in all situations [17]. Clustering algorithms vary in methodology and applications as described elsewhere [16,17,18]. For the purposes of this research, our methodology most resembles a type of fuzzy clustering, with distinct differences. Fuzzy Theory clustering algorithms look to apply a probability to an object belonging to a cluster [18]. In applying a probability to the clustering, the membership of a data point is shared among all clusters and thus the boundaries of the clusters become fuzzy [16]. In general, these approaches aim to minimize a cost-function and achieve some local-minima [16]. Our algorithm looks to minimize over some cost-function and define the membership to each cluster as a probability. The difference is that our algorithm uses the maximum probability to define clusters after training.
The structure of the paper is as follows: In Section 2 we introduce the clustering algorithm adopted in this work; in Section 3 we apply the algorithm to the USA counties in early 2020, focusing on the stability over the adoption of uneven measures; additionally in Section 3 we discuss the epidemiological implication of the adoption of the novel sub-populations; we finally offer our conclusions in Section 4.
Part of the data that support these findings (USA COVID surveillance) are publicly available through New York Times [19]. The other part of the data (USA contact rates) are available from data vendor Cuebiq but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Cuebiq. The mobility data from Cuebiq was provided as a weekly snapshot from January–June 2020. Several counties had missing mobility data, a list of these counties can be found in the supplementary material in Table A1. The Google mobility data and the Mobility Census data from Ontario are publicly available [20,21].
Gradient descent learning was first proposed by Louis-Augustin Cauchy in 1847 [22] as an optimization algorithm suited to solving systems of coupled differential equations. Gradient descent algorithms minimize some objective function with respect to its variables [23]. This is done by calculating the gradient of the objective function and taking a small step in the opposite (decreasing) direction of the gradient. The step size is controlled by a learning rate, which is, in general, rather small. Recently, these algorithms have been used to optimize neural networks [23].
Here, we apply this class of algorithm to a network formed by small geographical units (i.e., counties) that are connected by people's mobility among them. The main goal is to define a set of macro-regions formed by clusters of the network nodes, which are maximally connected inside each cluster and minimally connected to other clusters.
We first define an arbitrary number of clusters, denoted by Nclusters, under the assumption that this number is much smaller than the number of nodes in the network, Nclusters≪Nnodes. Clusters can be initalized randomly or using some intuitive initalization. The main outcome of the algorithm will therefore yield the reorganization of the nodes into the desired number of clusters using spacial mobility data over a given time horizon. The clusters have no fixed size. Cluster size can range from including no nodes to including all nodes. In application, clusters never contain a large portion of the total number of nodes. We define a matrix of probabilities
P∈RNnodes×Nclusters, | (2.1) |
where each element Pic∈[0,1] denotes the probability that the node i belongs to cluster c. Each row is bounded by conservation of probabilities to respect the following sum-rule:
Nclusters∑c=1Pic=1. | (2.2) |
The clustering algorithm must find the optimal probability matrix P following a loss function, which is defined based on the desired properties of the clusters.
By using continuous probabilities instead of Boolean assignments, we can define a differentiable objective function that can be optimized with gradient descent: successive small improvements can be made to the algorithm assignments, rather than having abrupt changes when nodes are reassigned. We then deterministically assign each node to its maximum-probability cluster to get the best node-to-cluster assignment solution. Using Batch Gradient descent, convergence is guaranteed in both convex and non-convex surfaces [23].
The optimization of the node-to-cluster assignment is based on a loss function that depends on the probability matrix P. It measures how accurately any value of P matches the required features of the clusters. This loss function should evaluate to a large value when we have an inaccurate solution, but close to zero when we have an accurate node-to-cluster assignment solution.
To fulfill our purposes, the loss function must have two parts, as we need to regulate two important measures: low mobility interactions among clusters and low population difference among clusters. Henceforth, we define the loss function as a convex combination of two terms:
LossTotal=αInt LossInt+αPop LossPop, | (2.3) |
where the constants αInt,αPop∈R+ control the relative strength of the two requirements. Once the convex weights are fixed, we employ a gradient descent method to minimize the loss function and find the optimal probability matrix P. The optimal αInt,αPop were calculated by performing an analysis on the effects of clustering results based on a range of possible values of αInt,αPop∈R+ and αInt+αPop=1. This work can be seen in Appendix.
The low mobility interaction LossInt takes into account the interactions between clusters, measured in terms of the population mobility among the nodes of the network. Hence, this measure relies on mobility data, expressed in terms of an interaction matrix, Interactionij, whose elements are proportional to people's flow from node i to node j. The loss function sums the interactions among all nodes belonging to different clusters, and it is defined as follows:
LossInt:=Nnodes∑i,j=1Interactionij Pdifferent(i,j), | (2.4) |
with Pdifferent(i,j) being the probability of node i to be in a different cluster than node j. By the definition of the probability matrix P, we have
Pdifferent(i,j)=1−[PPT]ij, | (2.5) |
hence the loss function can be written in terms of matrix operations as:
LossInt=Tr(Interaction(1−PPT)), | (2.6) |
with 1 being defined as a matrix of ones.
The low population difference LossPop forces the solution to contain clusters of approximately equal population. The main purpose of this term is to force the algorithm away from a trivial solution where all nodes are joined in a single giant cluster while the other clusters are left empty, which trivially minimizes the inter-cluster interactions. Due to this trivial solution we require the additional loss function term defined in Eq (2.7). We define the loss function as follows:
LossPop:=Nclusters∑c=1(EP[Population of cluster c]−TotalPopNclusters)2, | (2.7) |
where EP denotes the population of cluster c based on the node assignment given by the probabilities Pic and TotalPop is the total population in the network. Defining a vector of the node populations Popi, we have
EP[Population of cluster c]=Nnodes∑i=1Pic Popi=(PTPop)c. | (2.8) |
Hence, the loss function can be written in terms of matrix operations as:
LossPop=|PTPop−TotalPopNclusters|2. | (2.9) |
There is a subtle aspect in the implementation of a gradient descent algorithm to our problem. In fact, as the variables P must respect Pic∈[0,1] and ∑Nclustersc=1Pic=1, the gradient descent is not ideal as it would be applied to a constrained optimization problem with both box and linear constraints. Hence, to simplify the implementation, we apply the algorithm to a parametric matrix X∈RNnodes×Ncluster, where each entry Xic∈[−∞,∞] is unconstrained. In order to obtain P from X, the real-valued vectors must be converted into probabilities. A common approach in neural networks and deep learning is to use the softmax function [24,25]. Each row of the matrix P can be redefined as the following:
Pic=eXicΣNclustersc′=1eXic′. | (2.10) |
This is a common implementation artifact used in machine learning. To find a good cluster assignment configuration, the parameters at every step were updated using automatic differentiation in the grad package from the Jax library in Python using the following loss function:
Xnew=Xold−StepSize∇XLoss(X). | (2.11) |
We apply the clustering algorithm to a network made of 3102 counties and county equivalents, located within the 50 states and the District of Columbia (DC). In total, there exist 3144 counties and county equivalents within the USA, however 42 counties were missing from the Cuebiq dataset, and they are not included within our network. These missing counties will appear in white on the maps of the USA, as seen for instance in Figure 1.
The mobility data was provided from Cuebiq and it contains the number of users of their proprietary app traveling from county i to county j normalized by the number of users seen in county i. For each county, the data includes the 15 largest flows to other counties on a weekly timescale, hence the entries are largely dominated by commuter travelers among nearby counties. Airborne travelers, while not explicitly excluded, are numerically smaller than commuter and ground based ones and often do not make it above the cut or remain subleading. We consider the data from Cuebiq users as a good proxy of the total population of each county, as confirmed by the provider. Using this data from Cuebiq we were able to construct the 3102 by 3102 flow matrix used in Eq (2.4), where the matrix entries represented the flow from county i to county j normalized by the population of the county of origins [26].
In order to speed up the convergence of the algorithm, the variable X matrix was initialized using physical proximity among counties. To do this, we generated Nclusters "initialization central points" (ICPs) spread across the USA, and then initialized the X-values in the algorithm for the Ncounties nodes proportionally to the distances from these points. The distances are computed from the geographical center of each county i and the ICPs. Specifically, we set:
Xic=−0.1distance(Center of County i,ICP c), | (2.12) |
where 1≤i≤Ncounties is the county index and 1≤c≤Ncluster is the cluster index.
In this way, at initialization, the algorithm assigns the highest probability of belonging to the cluster of the nearest ICP. In practice, the ICPs are defined as a rectangular grid of equally spaced points covering the USA (including Alaska and Hawaii), see Figure 1(a) for an example with Nclusters=100. During the initialization process only a subset of clusters are populated, the remaining are later populated by the algorithm. For instance, the clusterings in Figure 1(b), (c) were obtained for two different loss function combinations and after 50,000 gradient descent steps.
As it can be seen in Figure 1(b), sizable values of αPop force the clusters to have similar population, however creating heterogeneity in the their geographical extension and discontinuities. This is due to the very heterogeneous distribution of the population, which leads to densely populated counties and very sparsely populated ones. As a result, the clustering in Figure 1(a) results in clusters consisting of a handful of counties near cities, in contrast to very extended and discontinuous ones in more rural areas. To adjust the outcome, we reduced the impact of the population requirement by choosing αPop=0.1 and αInt=0.99. This resulted in the clustering in Figure 1(b), leading to more comparable and geographically continuous clusters. We then tested the algorithm with various numbers of clusters, obtaining comparable results.
In the remainder of this work we will focus on Nclusters=49, which provides a number of clusters closely comparable to the number of states (50 + DC). Hence, we obtained the 49 clusters with pre-pandemic mobility levels, where the data was taken from the first week of January 2020. As a working point, we used skewed weights αInt=0.99 and αPop=0.01 to minimize cluster-to-cluster mobility flows. The obtained clusters are visualized in Figure 2(a). To test the residual level of mobility inter-cluster, we computed the following matrix at the end of the algorithm run
Minter-cluster=XToutInteractionXout, | (2.13) |
where Xout is the final value of the variable matrix X used to define the cluster probabilities via Eq (2.9). The entries of this matrix are visualized in Figure 2(b).
The vast majority of the activity is detected on the diagonal, which represent the mobility among counties belonging to the same clusters. Instead, off-diagonal entries feature very small values, indicating very limited mobility among clusters belonging to different clusters. This result, therefore, validates the effectiveness of the clustering algorithm.
In this section we review the results obtained from our model, exploring the stable clusters that were extracted from the mobility data obtained in the first 6 months of 2020. Using these "core clusters", we were able to show the temporal spread of COVID-19 from its origin to the entire USA in the first few weeks of 2020. It is clear to see that:
● The initial spread of the disease is due to case importation via flights, first appearing on the West Coast, then clearly mapped to spread via flights from some of the largest flight hubs in USA.
● In comparing the geographic clusters structure obtained from our algorithm with the USA States (census-based) structure, we highlight regions that had a disproportionately high number of COVID-19 cases. Some of these regions are large portions of a given census state, while others encompass two or more adjacent census states. This reveals that epidemiological data studied at state level is not always illuminating, and can be late in signaling cases rising at state level, while portions of a state could be already experiencing high transmission. Moreover, highlighting disease spread based on how people interact with their communities indicates that allocating public health resources to those areas could have a beneficial impact on the spread of future communicable disease outbreaks.
● Last but not least, we note the ability of this model to be applied to other regions beyond the USA using different (mobility, population, disease) data: As an example, we propose application to Ontario, Canada, where data on commuters and google mobility data are publicly available.
Over the course of a pandemic, people's mobility continually changes as different regions are imposed various non-pharmaceutical measures in order to contrast the spread of the disease. During the early phases of COVID19 we have seen some states/regions being placed in lockdown, while others had lighter restrictions. In principle, such changes could affect how counties are clustered by our algorithm via changes in the interaction matrix.
To test the stability of the algorithm results, we considered the first six months of 2020, which saw the first diffusion of the infections and the most severe travel and local mobility restrictions. During this period the majority of restrictions put in place by the USA government were to limit mobility [27]. Hence, we constructed six different clusterings using the mobility data from the first week of each month from January to June, 2020. We then use the output probability matrices Pα, where α labels the month, to check the stability of the output.
To do so, we first define a product matrix Mα which traces over the clusters:
Mα=PαPTα. | (3.1) |
Each entry Mij=∑Nclustersc=1PicPjc measures how likely the county i belongs to the same cluster as the county j, as the product of the two probability rows is maximized if the two coincide. To express this similarity measure more objectively, we normalize the entries of the matrix in Eq (3.1) as follows:
ˉMij=Mij√∑Nnodesj=1M2ij,fori∈[1,Nnodes]. | (3.2) |
The similarity measures are expressed by the diagonal entries of the product matrix
Sαβ=ˉMαˉMTβ. | (3.3) |
In fact, for α=β, the diagonal entries are all equal to 1 thanks to Eq (3.2). For α≠β, closeness to unity for the diagonal entries measures how similar the two clusterings α and β are to each other. We then construct the similarity matrix S for clusterings stemming from consecutive months from January to June, 2020. The distribution of the diagonal entries for the five cases are shown in Figure 3.
These results show that the majority of counties remain in the same cluster over the 6-month period, as the majority of the entries remain very close to 1. This fact allows us to distinguish counties that consistently belong to the same cluster, and counties that flip between different counties (at least once). The pruning of unstable counties can be performed by keeping only clusters whose diagonal S-entries remain above a given threshold over the 6-month period under study. We show in Figure 4 the result for three values of the threshold: 0.1, 0.2 and 0.5. The maps clearly show that inconsistent clusters tend to be located within less populated areas, while densely populated areas remain stable.
As an interesting highlight, pruning with threshold of 0.5 seems to agree quite well with the rural-urban map of the USA which we include in Figure 5(b), as presented in [28]:
We find a good compromise to define stable clusters after pruning counties with a threshold of 0.25, as shown in Figure 4(b), because we are not looking to exclude all rural counties from our analysis. Hence, a pruning with a threshold of 0.25 allows us to define 43 core clusters to be used for further analyses in the following section (6 clusters being emptied). We also note that we can check the effect of the population constraints we imposed in the clustering algorithm. In Figure 6 we compare the distribution of population in the 50 + DC states and in the 43 core clusters remaining from above pruning. One can see that several core clusters have a population around 8 millions, while state populations are concentrated around smaller values with a few very populous exceptions. This shows that small weights such as αPop = 0.1 should be sufficient to obtain clusters with fairly balanced populations.
COVID-19 incidence data is available at the county level in the USA. However, the small populations of many counties provide statistically poor data, which do not allow to clearly identify growing patterns. Hence, it is important to aggregate the data into larger geographical units. By applying the clustering algorithm we aimed at identifying well-mixed regional sub-populations. Having achieved our goal mathematically, we now focus our attention on the analysis of the epidemiological data in the newly formed sub-populations in the core clusters. As compared to a state-level analysis, our approach permits a new view on the initial COVID-19 spread, highlighting indirectly the importance of flights and case importations. As a consequence, our results call for more localized preventive measures in large states (such as California), and the need for coordination/cooperation among neighboring states in staving off disease spread.
The algorithmic definition of core clusters also provides a new view on the initial geographical spread of COVID-19 cases within the USA. In particular, it allows to see how the disease spreads geographically to the most populated areas, which are also served by airport hubs, which have been shown to play a crucial role for the diffusion of airborne diseases in many regions of the world (see for instance [15,29]). To illustrate this, we plot a timeline of the initial spread of COVID-19 starting from February 24 until March 16, 2020, hence over a 4-week period. In each geographical unit, we identify the time of transition from a disease-free state to the initial exponential increase in the incidence numbers. In Figure 7 we show the progression of the disease within core clusters (left panels) and states (right panels): each geographical unit is colored in red on the week when the exponential increase is first detected, then it turns green from the week after.
The difference between the two columns is telling. In the core cluster analysis, we see that the disease started in two clusters located in northern California and western Nevada. During the second week, there was a spread to nearby clusters in the west (Oregon, Seattle-area and Iowa) as well as to major airport hubs in the central and eastern part of the USA. We can easily identify isolated red clusters around Boston, New York, Washington, Chicago, Atlanta, Houston and Los Angeles in panel of Figure 7(a2). During the third week, the disease reaches the remainder of the territory, except for areas in Washington/Montana states and Alaska, which run red during the fourth week. The corresponding state-level analysis, shown in the right panels, features a similar overall pattern, however important details are missing or diluted. In particular, the importance of airports, which are indirectly highlighted in the core cluster analysis, is missing. Instead, the core cluster analysis confirms the results obtained, for instance, in [15], where evidence was collected that airborne traffic was the principal culprit for the spread of COVID-19 from California to the rest of the country with an analysis of data aggregated at census division level.
Furthermore, the use of the clustering algorithm allows to see specific features about the disease spread in local areas within some large states and also spanning across states. For instance, one can see that the disease starts effectively spreading in Texas near Houston and at the southern tip during the second week, while at state level one would conclude that Texas was one of the first affected states. These specific differences could inform policy, and could give decision makers and public health officials more tools to act on preventive measures very early on, rather than being guided by state-level views.
For visual confirmation of our results showing the importance of case importation by air, we plot the side-by-side panel of Figure 7(a2) and the enplanements map at the top 50 airports in the USA, courtesy of the USA Bureau of Transportation Statistics in Figure 8 below [30]. In the upper panel we see the red spots corresponding to airline hubs, while in the panel below we see the flight density (represented by the size of each bubble) around the top 50 airports in USA.
In this section we highlight the fact that looking at the transmission quantifiers in the core cluster geography vs. census-based geography may lead to interesting highlights: For instance, the case is clearly made below that state-level incidence reporting may be late in signaling state-transmission, when state level data overlooks hot transmission zones in a state, due to its adjacency with a different state. Moreover, core clusters overlapping two or more states give rise to identification of high transmission areas that may need policy and resource allocating coordination.
Starting from incidence data provided by the New York Times at the county level, we aggregated the data both at state and core cluster level. Our aim is to study the growth of the number of infections and compare different regions. We assumed that, at the beginning of a wave of infections, the incidence number inc(t) is given by an exponential curve of the type:
inc(t)=inc(0) eρt. | (3.4) |
This behavior would signal the phase of exponential growth in the infections and the start of an epidemiological wave. Hence, we can compute a time-series of the exponential growth factor:
ρ=lninc(t+1)inc(t), with inc(t)≠0. | (3.5) |
The growth factor ρ is computed from the initial phase of nearly exponential growth in the neighborhood of the disease-free equilibrium state, corresponding to a phase of linear growth in log(inc) with slope ρ. Using the growth factor estimates as above and the closed form formula for the initial reproduction number R0 based on ρ (see for details the work of [31]*) We identify the initial fastest phase of nearly unchecked growth in any given region with the help of a piecewise linear fit to the log of the incidence. We utilize the R function dpseg(), which is a part of the dpseg package [32]. This function uses a dynamic programming algorithm to generate an optimal piecewise linear fit to a time series, which balances goodness of fit against an (adjustable) penalty for each additional segment. We then identified the earliest segment with the steepest positive slope (largest ρ) as corresponding to the initial near-unchecked exponential growth phase. We introduced this computation earlier in our paper [33], where we deduce the exponential growth regime for the USA in early 2020 as shown in Figure 9.
*In a typical SEIR model, where σ is the susceptible to exposed rate and γ is the recovery rate, then R0=(ρ+σ)(ρ+γ)σγ, where ρ is the exponential growth factor.
We visualize the values of R0 obtained at state level in Figure 10(a) and by use of the core clusters in Figure 10(b). Here, the initial reproduction number R0 is computed over weeks 9 to 14 of 2020, that is the period from February 24th to March 16th, 2020. The results are fairly compatible, however a visual comparison of the two maps show that some counties are characterized by very different values.
To better visualize the difference between the two approaches - state vs. core clusters - for each county we computed the difference in the local R0 obtained by the two different aggregations:
ΔR0=R0(core cluster)−R0(state), | (3.6) |
and show the values in Figure 10(c), where the color gradient corresponds to the size of the difference. We see that areas in yellow highlight the biggest differences, meaning that the R0 values in those clusters were higher than the state-case data R0 indicated. The darker blue areas indicate that the state-case data gave a higher value of R0 than the clustering data. From a policy perspective, the yellow areas are the important ones, as their presence means that, looking strictly at the geographic state level, policymakers may feel optimistic about their state-wide R0 values when, in effect, due to people's mobility, the initial force of infection is much higher in many of their counties (see Figure 10(c)). The core cluster analysis, therefore, provides more reliable results for local communities living in a fraction of a state counties, where localized measures could be implemented to limit the incidence of the disease.
In order to capture the state-level versus the core cluster based view of the population and the concurrent disease spread, we extracted a few sample states with their corresponding clusters. We observed three possible situations: ⅰ) a state is essentially its own cluster without interactions with other clusters (e.g., Alaska and Maine); ⅱ) a state contains several clusters (e.g., California and Florida), and ⅲ) a cluster overlaps more than one state land mass. Case ⅰ) illustrates a trivial equivalence between the two methods, hence in this section we focus the analysis on examples of Cases ⅱ) and ⅲ) above.
California is a clear example of Case ⅱ), see Figure 11(b), as the majority of its population and territory is comprised within 4 core clusters: 41, 42, 39 and 18 (note that 18 also includes one county from Arizona). In Figure 11(a) we show the cases per capita, per day, smoothed via a 7 day rolling average, for California (in black) as compared to its four core clusters. Notably, not all the state cases are included, as some counties are removed from core clusters or are contained in out-of-state clusters. Thanks to the core cluster analysis, we can identify areas within California where the disease activity seems higher than elsewhere. For instance, clusters 18 and 42 had higher than state average cases per capita, thus it stands to reason to localize non-pharmaceutical interventions or preventive treatments in those zones. In Table 1 we list the counties comprised within clusters 18 and 42 and their population. Cluster 42 corresponds to the urban area of Los Angeles, and it saw an early rise of cases compared to the counties in nearby cluster 42. Instead, cluster 18 shows a larger incidence number at later stages in the pandemic: this makes sense, as the LA area has a much higher population density (so spread is very likely at rapid pace), and it was first detected as case positive by our analysis in Figure 7 very early, Week of March 1st 2020. Cluster 18 has a population density is a 10th of the LA-area and the initial spread happened a week later. It contains 3 Californian congressional districts (San Bernardino, Riverside and Imperial), with the first two districts known to be going back and forth from republican to democrat political representation. Yuma county, in Arizona, is similarly republican in presidential voting, with some democratic local representatives. Based on known correlations between willingness to adopt NPI measures and political leaning of USA individuals (see [34]), in a speculative way, our analysis seems to highlight the same argument: the higher than average incidence in cluster 18, though a lot sparser populated than 42, may be due to individual behaviour, i.e., due to a higher presence of individuals disinclined to adopt NPI measures.
Cluster 18 counties | Population | Cluster 42 counties | Population |
Yuma County | 207,829 | Los Angeles County | 10,098,052 |
Imperial County | 180,216 | Ventura County | 848,112 |
Riverside County | 2,383,286 | ||
San Bernardino County | 2,135,413 | ||
Population density | 132.77/sq mi | Population density | 1854.96/sq mi |
We performed a similar analysis for the state of Florida, which comprises two enclosed clusters, as shown in Figure 12 (while counties in the north are joined with the neighboring states). From Figure 12(a), we see that cluster 31 is disproportionately responsible for the spread during the 3 waves that Florida experienced in 2020, as compared to cluster 2. This effect could be explained by the higher population density in cluster 31, given by 780 people per square mile, as it also encloses Miami. Instead, cluster 2 has an average density of 318 people per square mile. Most likely, cluster 31 contains the counties of Monroe, Miami-Dade and Broward, the top 3 most tourist intensive and some of the nicest weather, thus providing ample opportunities for individuals to lower their risk perception of getting infected with COVID-19 [35].
To illustrate Case ⅲ) in this section, we took the example of the state of New Mexico: It is almost entirely covered by the much larger core cluster 38, which also includes a sizable part of Texas and a few counties in Colorado, as shown in Figure 13(b). In analogy to the analysis for Case ⅱ) before, in Figure 13(a) we show the cases per capita in the part of cluster 38 that overlaps with the states of New Mexico and Texas, versus the cases per capita in the whole cluster (in green). Interestingly, we see a rather different behavior in the two state portions of the cluster, due to the very different policies applied in the two states. Nevertheless, the interconnection among counties within the cluster, highlighted by our clustering algorithm, implies that disease could propagate from one side to the other far more easily that it could be expected by simply looking at the state boundaries. Here the analysis implies that some coordination and collaboration on non-pharmaceutical interventions and preventive policies against the spread of disease would be beneficial to New Mexico, and it would help the state of Texas as well. Since Texas has had very little control over its disease spread during the pandemic, its policy had directly negatively influenced New Mexico.
Our algorithm is built generically, all that is required are regions that can be subdivided into N sub-regions (such as counties or public health regions) and mobility data among sub-regions. To illustrate its versatility, we implemented it to other regions of the world, reporting here the case of the Ontario province in Canada. For the mobility information, in the case of Ontario, we switched from proprietary data to publicly available data. We used a 2016 work mobility survey as baseline for worker mobility and then adjusted it based on Google mobility index changes from baseline to obtain an interaction matrix. Moreover, we looked at Ontario as a collection of 34 public health regions, which are well-defined geographically.
To achieve a daily contact rate between health units in the province of Ontario across 23 months (from February 2020 to December 2021), we combined data from two sources: mobility reports by Google [20] and commuting flow data by the Government of Canada [21]. Both data sources are publicly available. Commuter flow data reveals 25.16% of the employed population in Ontario works outside the census division where they live. This data source highlights that 11 census divisions out of 49 have more than 40% of their workforce commuting daily outside the census division where they reside [21]. By combining this data set with the Google Mobility reports, we can estimate the daily contact rates among the network nodes in Ontario.
During the COVID-19 pandemic, Google Mobility reports captured changes in movement over time compared to baseline (pre-lockdown) activity in different categories, such as retail/recreation, transit stations, and workplaces [20]. Google's index data have been used in previous analyses [13,36,37]. Since Google split the mobility data into 51 regions in Ontario, corresponding to local municipalities, we had to combine several regions along their borders to obtain mobility data for 34 health units in this work (see [38] for more details). The same borders were considered to calculate the commuter rate for health units, as commuter data was reported at the census division level. Assume mij's represent entries of the commuter flow matrix between health units in Ontario, where i is the place of residence (POR) and j is the place of work (POW). Thus, the contact rate between the health unit i, POR, and the health unit j, POW, on day t is calculated as:
contactij=mijEmpigmi(t),wherem=Work, |
where Empi is the employed population size of health unit i and gm(t) represents the fluctuation percentage in the Google Work Index on day t compared to the baseline Google Index.
The result of the clustering algorithm is shown in Figure 14, where we particularly focus on the southern region of Ontario, where most of the population is concentrated. This part is subdivided into 6 clusters (while two more are found in the northern part). Interestingly, the cluster structure roughly coincides with the administrative division of health units, where 5 regions are implemented: East, Central East, Central West, South West and Toronto [39]. While East matches with the definition of cluster 1 (blue), which comprises Ottawa, the other regions are rearranged by our algorithm. Noticeably, Toronto is merged with Central East and part of Central West, following the intense commuter flow to the main city of the province from nearby regions. The other clusters clearly center over Niagara (8), Waterloo (4) and Windsor (5). Hence, our cluster definition seems to better represent the demographic character of the province of Ontario as compared to the administrative division.
In this paper we formulated and studied a novel clustering algorithm based on mobility data among small geographical units, like counties. We applied it to the case of the counties in the USA across the first 6 months of the 2020 pandemic COVID-19, and we introduced the notion of core clusters. They are defined by counties that remain consistently into the same cluster over a long period, hence being insensitive to changes in mobility due to the implementation of local non-pharmaceutical measures. The core clusters provide comparable geographical units characterized by a sub-population that is well-mixed by internal people's mobility. Hence, they offer an ideal basis to study the diffusion of an infectious disease within a large region. This new approach allowed us to capture the spatio-temporal spread of COVID-19 across USA and highlight features that are washed-out in a state-basis analysis. For instance, we found that the initial spread of COVID-19 in 2020 started from the north of California to a series of hot-zones that feature an airport hub. This result further confirms the relevance of airborne passenger transportation, as first highlighted in an earlier study using the nine census division as geographical units [15]. Furthermore, an analysis of the incidence number in core clusters showed that some states would need a higher granularity of localized measures to dampen disease spread while others showed the need for cooperation in dampening measures between neighboring states. The core clusters we identified in the USA could also be used as an efficient basis to predict the spread of an infectious disease across the country by use of a diffusion model, like for instance the eRG [40].
The work that was done to cluster Ontario was a simple example of the capabilities of our algorithm. Given any geographical area, the population distribution, and a form of mobility data, the algorithm is able to cluster virtually any region in order to discover well-connected sub-regions. This paper focuses on the usefulness of this algorithm on the COVID-19 pandemic, however no aspect of the algorithm is limited to the COVID-19 pandemic. As such, this algorithm is capable of being applied to a variety of problems that rely on population mobility. The fact that the algorithm is modular allows a user to be able to add additional terms onto the loss function. This allows for individuals to specifically tailor this algorithm to their problem.
From a policy perspective, the versatility of applying our algorithm at any granularity level in a given large geographic area is of interest for governing bodies, who are concerned about differing sub-populations living and working there, and their connections with their neighbors. In the last section we pointed out specific regions with a disproportionate number of cases per capita relative to other regions, all within the same geographical/census area. Utilizing this information, areas which have experienced higher than state case numbers per capita during the pandemic can be extrapolated to have a higher probability of transmission of a pathogen such as SARS-COV-2, due to their internal well-mixing. Thus, they can be a target of resource allocation deployment, in order to reduce the impacts of a future pandemic/epidemic: For instance, identifying regions where two or more states could coordinate their public policy (such is the case with New Mexico and Texas), can also translate in a resource allocation marker, such as deployment of PPE/medical equipment, lab and test supplies, data analysis expertise and capacity, and later vaccine supplies. On the other hand, cooperation in regional public policy would equally help, where two adjacent states, with well-mixing interstate populations, could try to deeply resources and preventive measures in a more coordinated fashion. Last but not least, we exemplify here several ways and uses of our study, however each aspect of our analyses can be developed more in-depth, depending on user needs.
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
All sources of funding for the study are disclosed below. M.-G. Cojocaru acknowledges support for the work from the National Sciences and Engineering Research Council (NSERC), via a Discovery Grant 400684 and an Alliance Grant Option Ⅱ (providing partial funding support for D. Lyver), and a Mathematics for Public Health Fields Institute grant (providing support for Z. Mohammadi). G. Cacciapaglia and C. Cot acknowledge partial support from the MITI project "Événements rares" of CNRS, project SpikeRG. Cuebiq mobility data was purchased and made available to the authors by Sanofi.
The authors declare there is no conflict of interest.
In order to address the problem of assessing how well the algorithm is clustering, a variation on the stochastic block model (SBM) is used. The SBM provides an ideal basis to ensure the accuracy of clustering algorithms [35]. An SBM forms a graph using the assumption that each node within a network belongs to a community and connects to other nodes within its own community with a probability p, and connects to other nodes not within its own community with probability q [35]. Once this graph is formed, the goal is to be able to recover the communities based on the connections between nodes in a process called community detection [35]. Several types of recovery are possible based on the values of p and q; exact recovery, partial recovery, and no recovery [35].
In order to achieve exact recovery using an SBM, the following inequality must hold [35]:
(n(p−q))2>2(n(p+q)), | (A1) |
where n is the number of nodes in the graph, and p and q are as defined previously.
In order to use the SBM to test the algorithm, an example model was created such that the algorithm could be rigorously tested for performance. In order to create this test graph, three variables needed to be defined, p, q and n. The definition of these variables remains the same as previously mentioned, but these variables will change throughout testing in order to better understand the algorithm's capabilities [35].
This investigation is aiding in establishing optimal ranges for the weights of the loss function in Eq (2.3). In order to determine the optimal values for αmobility and αPop a grid search was performed. From the grid search, heat maps were created in order to view the performance of the algorithm.
The performance in this case is defined by how close q can be to p, such that the algorithm still performs exact recovery. Arbitrarily, the cluster number chosen was 5 and the number of nodes in the graph was 500; however, a range of values was being analyzed in order to understand the impact of manipulating each value. The initial test was done by looking over all possible αmobility and αPop, where αPop+αmobility=1 with a step size 0.1. Each combination was tested on a set of p and q values, such that p+q=1 and p>q. The initial test is visualized in Figure A1.
In Figure A1(a), as the αmobility value increases, the closer q and p are able to become before the algorithm is unable to exactly extract the communities. Continuing this process and reducing the step size allowed for a "zoomed-in" grid search for the region of best-fitting parameters. As well, the step size between the p and q values needed to shrink in order to find the boundary where the algorithm was unable to reconstruct the clusters for certain combinations of α's.
We continually saw an improvement of the boundary as αmobility approaches 1. The heat map for the final test run was used to determine the optimal value of the α's which can be seen in Figure A1(b).
This results in an optimal range for αmobility∈[0.97,0.99] and αPop∈[0.03,0.01], which contains the estimated range that was assumed to provide the most accurate clusterization of the USA during previous testing.
Another interesting revelation that arose during this testing was the capability of this algorithm to be able to cluster past the bound of the Stochastic Block Model, SBM. Since the number of nodes was 500, the bound can be written as:
(500(p−q))2>2(500(1)). | (A2) |
Since n=500, and p+q=1, rearranging and solving for p generates the bound p>0.06+q or p>0.53. However, to obtain the results in Figure 5, the p-value is 0.501. This initially was a cause for concern; however, this is acceptable because of the additional information that is being provided to the system based on the population. This additional information is enough to allow the algorithm to detect communities past the theoretical bound. Since the number of nodes and the number of clusters were both arbitrarily chosen, the effects of altering these values on the algorithm's ability to extract the exact communities was examined.
First, the effect of changing the number of clusters was investigated, and initially, it was expected that minimal effects would be seen by adjusting the number of communities. While in terms of the most accurate values for the α terms, this is true, the most accurate αPop∈[0.05,0.01] and αmobility∈[0.95,0.99]. However, there is effect on the accuracy of clustering when the number of communities increases.
Next, the effect of changing the number of nodes while keeping the number of communities constant was analyzed, followed by the effect of both the number of communities and the number of nodes growing. This resulted once again in no change in terms of the optimal alpha values, as is demonstrated in Figure A2.
One may also note that even for the optimal α parameters, the closest that p and q can become until exact clustering is unable to be achieved is 0.7 and 0.3, respectively. A range of node values were analyzed up to n=3500, but all resulted in a similar trend. The reason for stopping at n=3500 was due to the forseen applications of this algorithm. It is our goal to ensure that this algorithm is able to cluster the USA accurately, and the USA contains approximately, 3100 counties.
We also attempted to analyze the results of increasing both the number of communities and the number of nodes to be analogous with the problem that is trying to be solved in the USA. However, this yielded no notable results.
The reason that some of these experiments yielded no notable results was because in increasing the number of communities and the number of nodes, the complexity of the problem increased. In order to understand the increase in complexity, it is critical to view what is being tested is whether a node has been placed in the correct community or if the node has been placed in the wrong community. This binary point of view allows the problem to be approached in a slightly different way. The number of connections between nodes in the same community grows by the following formula:
p(nk2). | (A3) |
For n representing the number of nodes, k representing the number of communities, and p representing the probability a connection exists between two nodes with the same community. The number of connections between nodes outside its own community grows by the following formula:
q(n−n/k)n. | (A4) |
Wherein n and k represent the same as in Eq (A3) and q represents the probability that a node exists between two nodes that are not in the same community.
Thus, as the number of nodes n and the number of communities k grows, there is a decrease in the performance of the algorithm. However, this decrease in performance is not seen in the clustering for the real-world application of the USA. The reason for this is due to the nature of human interaction within our collected data. The human interaction within our collected data causes a geological component to be introduced to the algorithm, as the flow metric that is being recorded is the actual travel between regions. This travel between regions is based on the geological location of each node, as for the most part, nodes that are very distant geographically have minimal to no travel between them. This in turn causes no large increase in the complexity of the problem, thus preventing the algorithm from failing. Importantly, this is only a hypothesis based on the properties of the real-world data versus the theoretical data that is generated from the SBM. This represents a topic that will require further research. An approach to this will likely involve similar techniques to that of the SBM, but a grid will be used in order to define distance within the network. Using this distance, it will be possible to only connect nodes that are in a neighbourhood around one another. Ideally, this distance metric will be capable to more closely capture the dynamics that are seen in real-world data.
Determining the accuracy of the model was done similarly to determining the best α values. Using the fixed α values αPop=0.01 and αmobility=0.99, we ran the clustering algorithm on a set of p and q values in order to determine how well the algorithm clusters as it approaches the theoretical boundary Due to previous experiments to determine the optimal α values, it was assumed that for k and n values k=5 and n=1000 the algorithm would be able to cluster the nodes perfectly even well past the theoretical boundary. This is demonstrated in Figure A3.
The blue line in Figure A3 represents the accuracy of clustering and shows that the algorithm is able to cluster the nodes into the correct communities past the theoretical boundary, which is represented by the orange horizontal line.
By increasing the number of nodes, the number of communities, or both, the problem becomes more difficult and a drop in accuracy is expected. By increasing the number of communities to k=10, this effect is shown in Figure A4.
Figure A4 shows that, when clustering for a larger number of communities, the algorithm is no longer able to cluster past or even up to the theoretical boundary, as the problem has become too challenging. It is likely the accuracy of the algorithm would be affected less if a form of distance was introduced, as mentioned previously. Attempting to increase the number of nodes while keeping the number of communities constant had no notable effect on the accuracy of the algorithm, and thus it behaved very similarly to that of Figure A3.
County fip code | County name |
2060 | Bristol Bay Borough |
2164 | Lake and Peninsula Borough |
2231 | Skagway-Yakutat-Angoon Census Area |
2232 | Skagway-Hoonah-Angoon Census Area |
2280 | Wrangell-Petersburg Census Area |
2282 | Yakutat Borough |
2290 | Yukon-Koyukuk Census Area |
6003 | Alpine County |
8005 | Arapahoe County |
12025 | Dade County |
13137 | Habersham County |
15005 | Kalawao County |
28117 | Prentiss County |
29137 | Monroe County |
30113 | Yellowstone National Park |
34027 | Morris County |
38045 | LaMoure County |
39139 | Richland County |
45045 | Georgetown County |
46113 | Shannon County |
48027 | Bell County |
48189 | Hale County |
51019 | Bedford County |
51081 | Greensville County |
51095 | James City County |
51515 | Bedford city |
51530 | Buena Vista city |
51540 | Charlottesville city |
51560 | Clifton Forge city |
51580 | Covington city |
51595 | Emporia city |
51600 | Fairfax city |
51660 | Harrisonburg city |
51678 | Lexington city |
51683 | Manassas city |
51685 | Manassas Park city |
51690 | Martinsville city |
51720 | Norton city |
51770 | Roanoke city |
51775 | Salem city |
51780 | South Boston city |
51790 | Staunton city |
51820 | Waynesboro city |
51840 | Winchester city |
[1] |
P. Boccaletto, M. A. Machnicka, E. Purta, P. Piatkowski, B. Baginski, T. K. Wirecki, et al., MODOMICS: A database of RNA modification pathways. 2017 update, Nucleic Acids Res., 46 (2018), D303-D307. https://doi.org/10.1093/nar/gkx1030 doi: 10.1093/nar/gkx1030
![]() |
[2] |
J. Song, C. Yi, Chemical modifications to RNA: A new layer of gene expression regulation, ACS Chem. Biol., 12 (2017), 316-325. https://doi.org/10.1021/acschembio.6b00960 doi: 10.1021/acschembio.6b00960
![]() |
[3] |
F. F. Davis, F. W. Allen, Ribonucleic acids from yeast which contain a fifth nucleotide, J. Biol. Chem., 227 (1957), 907-915. https://doi.org/10.1016/s0021-9258(18)70770-9 doi: 10.1016/s0021-9258(18)70770-9
![]() |
[4] |
W. E. Cohn, Pseudouridine, a carbon-carbon linked ribonucleoside in ribonucleic acids: Isolation, structure, and chemical characteristics, J. Biol. Chem., 235 (1960), 1488-1498. https://doi.org/10.1002/jbmte.390020410 doi: 10.1002/jbmte.390020410
![]() |
[5] |
T. Fujiwara, H. Harigae, Molecular pathophysiology and genetic mutations in congenital sideroblastic anemia, Free Radical Biol. Med., 133 (2019), 179-185. https://doi.org/10.1016/j.freeradbiomed.2018.08.008 doi: 10.1016/j.freeradbiomed.2018.08.008
![]() |
[6] |
N. Guzzi, M. Ciesla, P. C. T. Ngoc, S. Lang, S. Arora, M. Dimitriou, et al., Pseudouridylation of tRNA-derived fragments steers translational control in stem cells, Cell, 173 (2018), 1204-1216. https://doi.org/10.1016/j.cell.2018.03.008 doi: 10.1016/j.cell.2018.03.008
![]() |
[7] |
J. Karijolich, Y. T. Yu, Converting nonsense codons into sense codons by targeted pseudouridylation, Nature, 474 (2011), 395-398. https://doi.org/10.1038/nature10165 doi: 10.1038/nature10165
![]() |
[8] |
R. W. Holley, G. A. Everett, J. T. Madison, A. Zamir, Nucleotide sequences in the yeast alanine transfer ribonucleic acid, J. Biol. Chem., 240 (1965), 2122-2128. https://doi.org/10.1016/s0021-9258(18)97435-1 doi: 10.1016/s0021-9258(18)97435-1
![]() |
[9] |
C. Y. Gradeen, D. M.Billay, S. C. Chan, Analysis of bumetanide in human urine by high-performance liquid chromatography with fluorescence detection and gas chromatographyl/mass spectrometry, J. Anal. Toxicol., 14 (1990), 123-126. https://doi.org/10.1093/jat/14.2.123 doi: 10.1093/jat/14.2.123
![]() |
[10] |
A. Basak, C. C. Query, A pseudouridine residue in the spliceosome core is part of the filamentous growth program in yeast, Cell Rep., 8 (2014), 966-973. https://doi.org/10.1016/j.celrep.2014.07.004 doi: 10.1016/j.celrep.2014.07.004
![]() |
[11] |
T. M. Carlile, M. F. Rojas-Duran, B. Zinshteyn, H. Shin, K. M. Bartoli, W. V. Gilbert, Pseudouridine profiling reveals regulated mRNA pseudouridylation in yeast and human cells, Nature, 515 (2014), 143-146. https://doi.org/10.1038/nature13802 doi: 10.1038/nature13802
![]() |
[12] |
S. Schwartz, D. A. Bernstein, M. R. Mumbach, M. Jovanovic, R. H. Herbst, B. X. Leon-Ricardo, et al., Transcriptome-wide mapping reveals widespread dynamic-regulated pseudouridylation of ncRNA and mRNA, Cell, 159 (2014), 148-162. https://doi.org/10.1016/j.cell.2014.08.028 doi: 10.1016/j.cell.2014.08.028
![]() |
[13] |
X. Li, P. Zhu, S. Ma, J. Song, J. Bai, F. Sun, et al., Chemical pulldown reveals dynamic pseudouridylation of the mammalian transcriptome, Nat. Chem. Biol., 11 (2015), 592-597. https://doi.org/10.1038/nchembio.1836 doi: 10.1038/nchembio.1836
![]() |
[14] |
B. Panwar, G. P. Raghava, Prediction of uridine modifications in tRNA sequences, BMC Bioinf., 15 (2014), 326. https://doi.org/10.1186/1471-2105-15-326 doi: 10.1186/1471-2105-15-326
![]() |
[15] |
Y. H. Li, G. Zhang, Q. Cui, PPUS: A web server to predict PUS-specific pseudouridine sites, Bioinformatics, 31 (2015), 3362-3364. https://doi.org/10.1093/bioinformatics/btv366 doi: 10.1093/bioinformatics/btv366
![]() |
[16] |
W. Chen, H. Tang, J. Ye, H. Lin, K. C. Chou, iRNA-PseU: Identifying RNA pseudouridine sites, Mol. Ther. Nucleic Acids, 5 (2016), e332. https://doi.org/10.1038/mtna.2016.37 doi: 10.1038/mtna.2016.37
![]() |
[17] |
J. He, T. Fang, Z. Zhang, B. Huang, X. Zhu, Y. Xiong, PseUI: Pseudouridine sites identification based on RNA sequence information, BMC Bioinf., 19 (2018), 306. https://doi.org/10.1186/s12859-018-2321-0 doi: 10.1186/s12859-018-2321-0
![]() |
[18] |
M. Tahir, H. Tayara, K. T. Chong, iPseU-CNN: Identifying RNA pseudouridine sites using convolutional neural networks, Mol. Ther. Nucleic Acids, 16 (2019), 463-470. https://doi.org/10.1016/j.omtn.2019.03.010 doi: 10.1016/j.omtn.2019.03.010
![]() |
[19] |
K. Liu, W. Chen, H. Lin, XG-PseU: An eXtreme Gradient Boosting based method for identifying pseudouridine sites, Mol. Genet. Genomics, 295 (2020), 13-21. https://doi.org/10.1007/s00438-019-01600-9 doi: 10.1007/s00438-019-01600-9
![]() |
[20] |
Z. Lv, J. Zhang, H. Ding, Q. Zou, RF-PseU: A random forest predictor for RNA pseudouridine sites, Front. Bioeng. Biotechnol., 8 (2020), 134. https://doi.org/10.3389/fbioe.2020.00134 doi: 10.3389/fbioe.2020.00134
![]() |
[21] |
S. M. Khan, F. He, D. Wang, Y. Chen, D. Xu, Mu-pseudeep: A deep learning method for prediction of pseudouridine sites, Comput. Struct. Biotechnol. J., 18 (2020), 1877-1883. https://doi.org/10.1016/j.csbj.2020.07.010 doi: 10.1016/j.csbj.2020.07.010
![]() |
[22] |
F. Li, X. Guo, P. Jin, J. Chen, D. Xiang, J. Song, Porpoise: A new approach for accurate prediction of RNA pseudouridine sites, Briefings Bioinf., 22 (2021), bbab245. https://doi.org/10.1093/bib/bbab245 doi: 10.1093/bib/bbab245
![]() |
[23] |
Y. Q. Qian, H. Meng, W. Z. Lu, Z. J. Liao, Y. J. Ding, H. J. Wu, Identification of DNA-binding proteins via hypergraph based laplacian support vector machine, Curr. Bioinf., 17 (2022), 108-117. https://doi.org/10.2174/1574893616666210806091922 doi: 10.2174/1574893616666210806091922
![]() |
[24] |
S. Naseer, W. Hussain, Y. D. Khan, N. Rasool, NPalmitoylDeep-PseAAC: A predictor of N-palmitoylation sites in proteins using deep representations of proteins and PseAAC via modified 5-Steps rule, Curr. Bioinf., 16 (2021), 294-305. https://doi.org/10.2174/1574893615999200605142828 doi: 10.2174/1574893615999200605142828
![]() |
[25] |
S. W. Sun, L. Xu, Q. Zou, G. H. Wang, BP4RNAseq: A babysitter package for retrospective and newly generated RNA-seq data analyses using both alignment-based and alignment-free quantification methods, Bioinformatics, 37 (2021), 1319-1321. https://doi.org/10.1093/bioinformatics/btaa832 doi: 10.1093/bioinformatics/btaa832
![]() |
[26] |
L. Zhang, Z. Huang, L. Kong, CSBPI_Site: Multi-information sources of features to RNA binding sites prediction, Curr. Bioinf., 16 (2021), 691-699. https://doi.org/10.2174/1574893615666210108093950 doi: 10.2174/1574893615666210108093950
![]() |
[27] |
Z. Zhang, F. Cui, W. Su, L. Dou, A. Xu, C. Cao, Q. Zou, webSCST: An interactive web application for single-cell RNA-sequencing data and spatial transcriptomic data integration, Bioinformatics, 38 (2022), 3488-3489. https://doi.org/ 10.1093/bioinformatics/btac350 doi: 10.1093/bioinformatics/btac350
![]() |
[28] |
X. Wang, S. Wang, H. Fu, X. Ruan, X. Tang, DeepFusion-RBP: Using Deep Learning to Fuse Multiple Features to Identify RNA-binding Protein Sequences, Curr. Bioinf., 16 (2021), 1089-1100. https://doi.org/ 10.2174/1574893616666210618145121 doi: 10.2174/1574893616666210618145121
![]() |
[29] |
W. Chen, H. Ding, X. Zhou, H. Lin, K. C. Chou, iRNA(m6A)-PseDNC: Identifying N(6)-methyladenosine sites using pseudo dinucleotide composition, Anal. Biochem., 561 (2018), 59-65. https://doi.org/10.1016/j.ab.2018.09.002 doi: 10.1016/j.ab.2018.09.002
![]() |
[30] |
L. Wei, M. Liao, Y. Gao, R. Ji, Z. He, Q. Zou, Improved and promising identification of human microRNAs by incorporating a high-quality negative set, IEEE/ACM Trans. Comput. Biol. Bioinf., 11 (2014), 192-201. https://doi.org/10.1109/TCBB.2013.146 doi: 10.1109/TCBB.2013.146
![]() |
[31] |
B. Liu, X. Gao, H. Zhang, BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches, Nucleic Acids Res., 47 (2019), e127. https://doi.org/10.1093/nar/gkz740 doi: 10.1093/nar/gkz740
![]() |
[32] |
W. Chen, X. Zhang, J. Brooker, H. Lin, L. Zhang, K. C. Chou, PseKNC-General: A cross-platform package for generating various modes of pseudo nucleotide compositions, Bioinformatics, 31 (2015), 119-120. https://doi.org/10.1093/bioinformatics/btu602 doi: 10.1093/bioinformatics/btu602
![]() |
[33] |
H. Yang, H. Lv, H. Ding, W. Chen, H. Lin, iRNA-2OM: A sequence-based predictor for identifying 2'-O-Methylation sites in Homo sapiens, J. Comput. Biol., 25 (2018), 1266-1277.https://doi.org/10.1089/cmb.2018.0004 doi: 10.1089/cmb.2018.0004
![]() |
[34] |
B. Liu, BioSeq-Analysis: A platform for DNA, RNA and protein sequence analysis based on machine learning approaches, Briefings Bioinf., 20 (2019), 1280-1294. https://doi.org/10.1093/bib/bbx165 doi: 10.1093/bib/bbx165
![]() |
[35] |
Y. Hu, T. Zhao, N. Zhang, Y. Zhang, L. Cheng, A review of recent advances and research on drug target identification methods, Curr. Drug Metab., 20 (2019), 209-216. https://doi.org/10.2174/1389200219666180925091851 doi: 10.2174/1389200219666180925091851
![]() |
[36] | A. S. Nair, S. P. Sreenadhan, A coding measure scheme employing electron-ion interaction pseudopotential (EⅡP), Bioinformation, 1 (2006), 197-202. |
[37] |
H. Peng, F. Long, C. Ding, Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans. Pattern Anal. Mach. Intell., 27 (2005), 1226-1238. https://doi.org/10.1109/TPAMI.2005.159 doi: 10.1109/TPAMI.2005.159
![]() |
[38] |
Y. Tian, Z. Qi, Review on: Twin support vector machines, Ann. Data Sci., 1 (2014), 253-277. https://doi.org/10.1007/s40745-014-0018-4 doi: 10.1007/s40745-014-0018-4
![]() |
[39] |
L. Cheng, J. Sun, W. Xu, L. Dong, Y. Hu, M. Zhou, OAHG: An integrated resource for annotating human genes with multi-level ontologies, Sci. Rep., 6 (2016), 1-9. https://doi.org/10.1038/srep34820 doi: 10.1038/srep34820
![]() |
[40] |
L. Y. Wei, S. X. Wan, J. S. Guo, K. K. L. Wong, A novel hierarchical selective ensemble classifier with bioinformatics application, Artif. Intell. Med., 83 (2017), 82-90. https://doi.org/10.1016/j.artmed.2017.02.005 doi: 10.1016/j.artmed.2017.02.005
![]() |
[41] |
B. Liu, C. C. Li, K. Yan, DeepSVM-fold: protein fold recognition by combining support vector machines and pairwise sequence similarity scores generated by deep learning networks, Briefings Bioinf., 21 (2020), 1733-1741. https://doi.org/10.1093/bib/bbz098 doi: 10.1093/bib/bbz098
![]() |
[42] |
D. Mrozek, P. Gosk, B. Małysiak-Mrozek, Scaling Ab initio predictions of 3D protein structures in microsoft azure cloud, J. Grid Comput., 13 (2015), 561-585. https://doi.org/10.1007/s10723-015-9353-8 doi: 10.1007/s10723-015-9353-8
![]() |
[43] |
R. Cao, J. Cheng, Protein single-model quality assessment by feature-based probability density functions, Sci. Rep., 6 (2016), 23990. https://doi.org/10.1038/srep23990 doi: 10.1038/srep23990
![]() |
[44] |
W. Chen, H. Yang, P. Feng, H. Ding, H. Lin, iDNA4mC: Identifying DNA N-4-methylcytosine sites based on nucleotide chemical properties, Bioinformatics, 33 (2017), 3518-3523. https://doi.org/ 10.1093/bioinformatics/btx479 doi: 10.1093/bioinformatics/btx479
![]() |
[45] |
W. Chen, H. Lv, F. Nie, H. Lin, i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome, Bioinformatics, 35 (2019), 2796-2800. https://doi.org/10.1093/bioinformatics/btz015 doi: 10.1093/bioinformatics/btz015
![]() |
[46] |
G. Pan, J. Tang, F. Guo, Analysis of co-associated transcription factors via ordered adjacency differences on motif distribution, Sci. Rep., 7 (2017), 43597. https://doi.org/10.1038/srep43597 doi: 10.1038/srep43597
![]() |
[47] |
W. He, C. Jia, Q. Zou, 4mCPred: machine learning methods for DNA N4-methylcytosine sites prediction, Bioinformatics, 35 (2019), 593-601. https://doi.org/10.1093/bioinformatics/bty668 doi: 10.1093/bioinformatics/bty668
![]() |
[48] |
L. Jiang, Y. Ding, J. Tang, F. Guo, MDA-SKF: Similarity kernel fusion for accurately discovering miRNA-Disease association, Front. Genet., 9 (2018), 618. https://doi.org/10.3389/fgene.2018.00618 doi: 10.3389/fgene.2018.00618
![]() |
[49] |
Y. Xiong, Q. Wang, J. Yang, X. Zhu, D. Q. Wei, PredT4SE-Stack: Prediction of bacterial type Ⅳ secreted effectors from protein sequences using a stacked ensemble method, Front. Microbiol., 9 (2018), 2571. https://doi.org/10.3389/fmicb.2018.02571 doi: 10.3389/fmicb.2018.02571
![]() |
[50] |
L. Yu, J. Zhao, L. Gao, Predicting potential drugs for breast cancer based on miRNA and tissue specificity, Int. J. Biol. Sci., 14 (2018), 971-982. https://doi.org/10.7150/ijbs.23350 doi: 10.7150/ijbs.23350
![]() |
[51] |
M. Zhang, Y. Xu, L. Li, Z. Liu, X. Yang, D. J. Yu, Accurate RNA 5-methylcytosine site prediction based on heuristic physical-chemical properties reduction and classifier ensemble, Anal. Biochem., 550 (2018), 41-48. https://doi.org/10.1016/j.ab.2018.03.027 doi: 10.1016/j.ab.2018.03.027
![]() |
[52] |
Y. Ding, J. Tang, F. Guo, Identification of drug-side effect association via multiple information integration with centered kernel alignment, Neurocomputing, 325 (2019), 211-224. https://doi.org/10.1016/j.neucom.2018.10.028 doi: 10.1016/j.neucom.2018.10.028
![]() |
[53] |
B. Manavalan, S. Basith, T. H. Shin, L. Wei, G. Lee, Meta-4mCpred: A sequence-based meta-predictor for accurate DNA 4mC site prediction using effective feature representation, Mol. Ther. Nucleic Acids, 16 (2019), 733-744. https://doi.org/10.1016/j.omtn.2019.04.019 doi: 10.1016/j.omtn.2019.04.019
![]() |
[54] |
P. Feng, H. Yang, H. Ding, H. Lin, W. Chen, K. C. Chou, iDNA6mA-PseKNC: Identifying DNA N(6)-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC, Genomics, 111 (2019), 96-102. https://doi.org/10.1016/j.ygeno.2018.01.005 doi: 10.1016/j.ygeno.2018.01.005
![]() |
[55] |
L. Kong, L. Zhang, i6mA-DNCP: Computational identification of DNA N(6)-methyladenine sites in the rice genome using optimized dinucleotide-based features, Genes, 10 (2019), 828. https://doi.org/10.3390/genes10100828 doi: 10.3390/genes10100828
![]() |
[56] |
C. C. Li, B. Liu, MotifCNN-fold: protein fold recognition based on fold-specific features extracted by motif-based convolutional neural networks, Briefings Bioinf., 21 (2020), 2133-2141. https://doi.org/10.1093/bib/bbz133 doi: 10.1093/bib/bbz133
![]() |
[57] |
X. Shan, X. Wang, C. D. Li, Y. Chu, Y. Zhang, Y. Xiong, et al., Prediction of CYP450 enzyme-substrate selectivity based on the network-based label space division method, J. Chem. Inf. Model., 59 (2019), 4577-4586. https://doi.org/10.1021/acs.jcim.9b00749 doi: 10.1021/acs.jcim.9b00749
![]() |
[58] |
X. Wang, X. Zhu, M. Ye, Y. Wang, C. D. Li, Y. Xiong, et al., STS-NLSP: A network-based label space partition method for predicting the specificity of membrane transporter substrates using a hybrid feature of structural and semantic similarity, Front. Bioeng. Biotechnol., 7 (2019), 306. https://doi.org/10.3389/fbioe.2019.00306 doi: 10.3389/fbioe.2019.00306
![]() |
[59] |
L. Wei, S. Luan, L. A. E. Nagai, R. Su, Q. Zou, Exploring sequence-based features for the improved prediction of DNA N4-methylcytosine sites in multiple species, Bioinformatics, 35 (2019), 1326-1333. https://doi.org/10.1093/bioinformatics/bty824 doi: 10.1093/bioinformatics/bty824
![]() |
[60] |
L. Wei, R. Su, S. Luan, Z. Liao, B. Manavalan, Q. Zou, et al., Iterative feature representations improve N4-methylcytosine site prediction, Bioinformatics, 35 (2019), 4930-4937. https://doi.org/10.1093/bioinformatics/btz408 doi: 10.1093/bioinformatics/btz408
![]() |
[61] |
L. Xu, G. Liang, C. Liao, G. D. Chen, C. C. Chang, k-Skip-n-Gram-RF: A random forest based method for Alzheimer's disease protein identification, Front. Genet., 10 (2019), 33. https://doi.org/10.3389/fgene.2019.00033 doi: 10.3389/fgene.2019.00033
![]() |
[62] |
L. H. Roland, C. T. Wannige, A deep learning model for predicting DNA N6-methyladenine (6mA) sites in eukaryotes, IEEE Access, 8 (2020), 175535-175545. https://doi.org/10.1109/access.2020.3025990 doi: 10.1109/access.2020.3025990
![]() |
[63] |
Z. Chen, P. Zhao, F. Li, T. T. Marquez-Lago, A. Leier, J. Revote, et al., iLearn: An integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data, Briefings Bioinf., 21 (2020), 1047-1057. https://doi.org/10.1093/bib/bbz041 doi: 10.1093/bib/bbz041
![]() |
[64] | G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, et al., LightGBM: A highly efficient gradient boosting decision tree, in Advances in Neural Information Processing Systems 30 (NIP 2017), 30 (2017), 1-9. |
[65] |
C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn., 20 (1995), 273-297. https://doi.org/ 10.1007/BF00994018 doi: 10.1007/BF00994018
![]() |
[66] |
H. Zhou, H. Wang, Y. Ding, J. Tang, Multivariate information fusion for identifying antifungal peptides with Hilbert-Schmidt independence criterion, Curr. Bioinf., 17 (2022), 89-100. https://doi.org/10.2174/1574893616666210727161003 doi: 10.2174/1574893616666210727161003
![]() |
[67] |
C. Wang, Y. Ju, Q. Zou, C. Lin, DeepAc4C: A convolutional neural network model with hybrid features composed of physico-chemical patterns and distributed representation information for identification of N4 acetylcytidine in mRNA, Bioinformatics, 38 (2022), 52-57. https://doi.org/10.1093/bioinformatics/btab611 doi: 10.1093/bioinformatics/btab611
![]() |
[68] |
X. Guo, W. Zhou, B. Shi, X. Wang, A. Du, Y. Ding, et al., An efficient multiple kernel support vector regression model for assessing dry weight of hemodialysis patients, Curr. Bioinf., 16 (2021), 284-293. https://doi.org/ 10.2174/1574893615999200614172536 doi: 10.2174/1574893615999200614172536
![]() |
[69] |
E. Scornet, Random forests and kernel methods, IEEE Trans. Inf. Theory, 62 (2016), 1485-1500. https://doi.org/10.1109/tit.2016.2514489 doi: 10.1109/tit.2016.2514489
![]() |
[70] |
S. Zhao, Y. Ju, X. Ye, J. Zhang, S. Han, Bioluminescent proteins prediction with voting strategy, Curr. Bioinf., 16 (2021), 240-251. https://doi.org/ 10.2174/1574893615999200601122328 doi: 10.2174/1574893615999200601122328
![]() |
[71] |
M. Niu, Q. Zou, C. Wang, GMNN2CD: Identification of circRNA-disease associations based on variational inference and graph Markov neural networks, Bioinformatics, 38 (2022), 2246-2253. https://doi.org/ 10.1093/bioinformatics/btac079 doi: 10.1093/bioinformatics/btac079
![]() |
[72] |
A. K. Sharma, R. Srivastava, Protein secondary structure prediction using character Bi-gram embedding and Bi-LSTM, Curr. Bioinf., 16 (2021), 333-338. https://doi.org/10.2174/1574893615999200601122840 doi: 10.2174/1574893615999200601122840
![]() |
[73] |
C. Wang, C. Han, Q. Zhao, X. Chen, Circular RNAs and complex diseases: from experimental results to computational models, Briefings Bioinf., 22 (2021), bbab286. https://doi.org/10.1093/bib/bbac357 doi: 10.1093/bib/bbac357
![]() |
[74] |
A. Alim, A. Rafay, I. Naseem, PoGB-pred: Prediction of antifreeze proteins sequences using amino acid composition with feature selection followed by a sequential-based ensemble approach, Curr. Bioinf., 16 (2021), 446-456. https://doi.org/10.2174/1574893615999200707141926 doi: 10.2174/1574893615999200707141926
![]() |
[75] |
Y. Tian, X. Ju, Z. Qi, Y. Shi, Improved twin support vector machine, Sci. China Math., 57 (2013), 417-432. https://doi.org/10.1007/s11425-013-4718-6 doi: 10.1007/s11425-013-4718-6
![]() |
[76] |
Y. Zou, H. Wu, X. Guo, L. Peng, Y. Ding, J. Tang, et al., MK-FSVM-SVDD: A multiple kernel-based fuzzy SVM model for predicting DNA-binding proteins via support vector data description, Curr. Bioinf., 16 (2021), 274-283. https://doi.org/10.2174/1574893615999200607173829 doi: 10.2174/1574893615999200607173829
![]() |
[77] |
Q. Tang, F. Nie, Q. Zhao, W. Chen, A merged molecular representation deep learning method for blood-brain barrier permeability prediction, Briefings Bioinf., 2022 (2022), bbac357. https://doi.org/10.1093/bib/bbac357 doi: 10.1093/bib/bbac357
![]() |
[78] |
F. Li, X. Guo, D. Xiang, M. E. Pitt, A. Bainomugisa, L. J. M. Coin, Computational analysis and prediction of PE_PGRS proteins using machine learning, Comput. Struct. Biotechnol. J., 20 (2022), 662-674. https://doi.org/ 10.1016/j.csbj.2022.01.0192001-0370 doi: 10.1016/j.csbj.2022.01.0192001-0370
![]() |
[79] |
F. Sun, J. Sun, Q. Zhao, A deep learning method for predicting metabolite-disease associations via graph neural network, Briefings Bioinf., 23 (2022), bbac266. https://doi.org/10.1093/bib/bbac266 doi: 10.1093/bib/bbac266
![]() |
[80] |
F. Li, S. Dong, A. Leier, M. Han, X. Guo, J. Xu, et al., Positive-unlabeled learning in bioinformatics and computational biology: A brief review, Briefings Bioinf., 23 (2021), bbab461. https://doi.org/10.1093/bib/bbab461 doi: 10.1093/bib/bbab461
![]() |
[81] |
W. Liu, Y. Jiang, L. Peng, X. Sun, W. Gan, Q. Zhao, et al., Inferring gene regulatory networks using the improved Markov blanket discovery algorithm, Interdiscip. Sci. Comput. Life Sci., 14 (2022), 168-181. https://doi.org/10.1007/s12539-021-00478-9 doi: 10.1007/s12539-021-00478-9
![]() |
1. | Khaled Obaideen, Montaser Mahmoud, Mohammad Ali Abdelkareem, Abdul Hai Alami, Abdul Ghani Olabi, 2023, 9780128035818, 10.1016/B978-0-443-15738-7.00007-6 |
Cluster 18 counties | Population | Cluster 42 counties | Population |
Yuma County | 207,829 | Los Angeles County | 10,098,052 |
Imperial County | 180,216 | Ventura County | 848,112 |
Riverside County | 2,383,286 | ||
San Bernardino County | 2,135,413 | ||
Population density | 132.77/sq mi | Population density | 1854.96/sq mi |
County fip code | County name |
2060 | Bristol Bay Borough |
2164 | Lake and Peninsula Borough |
2231 | Skagway-Yakutat-Angoon Census Area |
2232 | Skagway-Hoonah-Angoon Census Area |
2280 | Wrangell-Petersburg Census Area |
2282 | Yakutat Borough |
2290 | Yukon-Koyukuk Census Area |
6003 | Alpine County |
8005 | Arapahoe County |
12025 | Dade County |
13137 | Habersham County |
15005 | Kalawao County |
28117 | Prentiss County |
29137 | Monroe County |
30113 | Yellowstone National Park |
34027 | Morris County |
38045 | LaMoure County |
39139 | Richland County |
45045 | Georgetown County |
46113 | Shannon County |
48027 | Bell County |
48189 | Hale County |
51019 | Bedford County |
51081 | Greensville County |
51095 | James City County |
51515 | Bedford city |
51530 | Buena Vista city |
51540 | Charlottesville city |
51560 | Clifton Forge city |
51580 | Covington city |
51595 | Emporia city |
51600 | Fairfax city |
51660 | Harrisonburg city |
51678 | Lexington city |
51683 | Manassas city |
51685 | Manassas Park city |
51690 | Martinsville city |
51720 | Norton city |
51770 | Roanoke city |
51775 | Salem city |
51780 | South Boston city |
51790 | Staunton city |
51820 | Waynesboro city |
51840 | Winchester city |
Cluster 18 counties | Population | Cluster 42 counties | Population |
Yuma County | 207,829 | Los Angeles County | 10,098,052 |
Imperial County | 180,216 | Ventura County | 848,112 |
Riverside County | 2,383,286 | ||
San Bernardino County | 2,135,413 | ||
Population density | 132.77/sq mi | Population density | 1854.96/sq mi |
County fip code | County name |
2060 | Bristol Bay Borough |
2164 | Lake and Peninsula Borough |
2231 | Skagway-Yakutat-Angoon Census Area |
2232 | Skagway-Hoonah-Angoon Census Area |
2280 | Wrangell-Petersburg Census Area |
2282 | Yakutat Borough |
2290 | Yukon-Koyukuk Census Area |
6003 | Alpine County |
8005 | Arapahoe County |
12025 | Dade County |
13137 | Habersham County |
15005 | Kalawao County |
28117 | Prentiss County |
29137 | Monroe County |
30113 | Yellowstone National Park |
34027 | Morris County |
38045 | LaMoure County |
39139 | Richland County |
45045 | Georgetown County |
46113 | Shannon County |
48027 | Bell County |
48189 | Hale County |
51019 | Bedford County |
51081 | Greensville County |
51095 | James City County |
51515 | Bedford city |
51530 | Buena Vista city |
51540 | Charlottesville city |
51560 | Clifton Forge city |
51580 | Covington city |
51595 | Emporia city |
51600 | Fairfax city |
51660 | Harrisonburg city |
51678 | Lexington city |
51683 | Manassas city |
51685 | Manassas Park city |
51690 | Martinsville city |
51720 | Norton city |
51770 | Roanoke city |
51775 | Salem city |
51780 | South Boston city |
51790 | Staunton city |
51820 | Waynesboro city |
51840 | Winchester city |