Citation: Enrico Capobianco. Born to be Big: data, graphs, and their entangled complexity[J]. Big Data and Information Analytics, 2016, 1(2): 163-169. doi: 10.3934/bdia.2016002
[1] | Ana Jofre, Lan-Xi Dong, Ha Phuong Vu, Steve Szigeti, Sara Diamond . Rendering website traffic data into interactive taste graph visualizations. Big Data and Information Analytics, 2017, 2(2): 107-118. doi: 10.3934/bdia.2017003 |
[2] | M Supriya, AJ Deepa . Machine learning approach on healthcare big data: a review. Big Data and Information Analytics, 2020, 5(1): 58-75. doi: 10.3934/bdia.2020005 |
[3] | Zhouchen Lin . A Review on Low-Rank Models in Data Analysis. Big Data and Information Analytics, 2016, 1(2): 139-161. doi: 10.3934/bdia.2016001 |
[4] | Nick Cercone . What's the Big Deal About Big Data?. Big Data and Information Analytics, 2016, 1(1): 31-79. doi: 10.3934/bdia.2016.1.31 |
[5] | Richard Boire . UNDERSTANDING AI IN A WORLD OF BIG DATA. Big Data and Information Analytics, 2018, 3(1): 22-42. doi: 10.3934/bdia.2018001 |
[6] | Ali Asgary, Jianhong Wu . ADERSIM-IBM partnership in big data. Big Data and Information Analytics, 2016, 1(4): 277-278. doi: 10.3934/bdia.2016010 |
[7] | Ming Yang, Dunren Che, Wen Liu, Zhao Kang, Chong Peng, Mingqing Xiao, Qiang Cheng . On identifiability of 3-tensors of multilinear rank (1; Lr; Lr). Big Data and Information Analytics, 2016, 1(4): 391-401. doi: 10.3934/bdia.2016017 |
[8] | Bill Huajian Yang . Modeling path-dependent state transitions by a recurrent neural network. Big Data and Information Analytics, 2022, 7(0): 1-12. doi: 10.3934/bdia.2022001 |
[9] | Pankaj Sharma, David Baglee, Jaime Campos, Erkki Jantunen . Big data collection and analysis for manufacturing organisations. Big Data and Information Analytics, 2017, 2(2): 127-139. doi: 10.3934/bdia.2017002 |
[10] | Tieliang Gong, Qian Zhao, Deyu Meng, Zongben Xu . Why Curriculum Learning & Self-paced Learning Work in Big/Noisy Data: A Theoretical Perspective. Big Data and Information Analytics, 2016, 1(1): 111-127. doi: 10.3934/bdia.2016.1.111 |
Big Data brings many problems to the general attention of physicists, mathematicians, social scientists, biologists, etc. [18,1]. A first attempt to categorize them into major groups runs into the problem of choosing a criterion of classification among many available ones. Differentiation of Big Data operates through their types (complex data structures) and the relationships that can be established between them (complex data patterns). Knowing Big Data distributional laws might simplify the task of understanding the essential characteristics (sufficient statistics) of their complexities through newly designed sampling techniques, fast data mining methods and efficient algorithmic processing.
Let us consider three features or attributes of data complexity destined to change due to the effect of size or bigness: dimensionality, heterogeneity and uncertainty (Figure 1). Let us also assume that Big Data uncertainty requires almost axiomatic solutions (say, cross-validation), ranging across a myriad of statistical model selection methods suitably adapted. In general, uncertainty can be associated to entropy and considering a fluctuation theorem for networks, changes in entropy reflect positive changes in resilience against perturbations [6]. In particular, bigger average shortest path lengths in resilient networks mitigate node removal effects, inducing slower network disintegration.
Among the challenges, the one coming from imbalanced data classification and incompleteness, is inherently data dependent. In general, given data with a stratified structure, a lack of balance exists when the classes are not equally represented in the data, which might reflect the sparseness of features rather than the class definition itself [5]. Thus, a class of interest could be the one addressing treated patients, and presenting few instances compared to a more largely represented class of control patients. Moving to a larger dataset is the most immediate solution towards the goal of rebalancing the classes. Sampling is also a strategy that can augment the poorer class (over-sampling) or diminish the richer one (under-sampling). The common aspect is ending up in both scenarios with synthetic samples. Importing a penalization strategy into the model is another possible route aimed to discount classification mistakes. By bringing penalties into the model, the latter can rebalance the over- and under-represented classes.
Interestingly, with Big Data one turns from the usual curse of dimensionality (large
Dπ(y,X,Θ)=π1d1(y,X,Θ1)+π2d2(y,X,Θ2)+…+πkdk(y,X,Θk) | (1) |
Several interdependent influences need to be considered for model selection purposes in an attempt to improve analyses and inference: additional dimensions or further stratifications in data are expected to weaken the ratio between systematic versus erratic systems characterization. Nevertheless, major problems are most likely coming from:
● Mismatch between dimensions (typically, sample size and characterizing variables), requiring regularized solutions;
● Existence of inherent but latent stratifications, inducing clusters or communities;
● Influence of stochastic components (only in part observable)
Superior precision of estimates and stabilization of variability are expected with Big Data, but the complexity increases too, due to novel classifications and sampling rates, both becoming sub-optimal at aggregate level. With regard to the latter aspect, an important question is: how likely is the chance of operating at under-sampled data conditions with Big Data? Then, how to recover a correct sampling rate in such a context, a problem related to the so-called Nyquist rate? An associated problem is aliasing, which arises when a signal is discretely sampled at a rate that does not allow to capture the changes in the signal. To avoid aliasing, the sampling frequency should be at least twice the highest frequency contained in the signal [12,24,4]. With a plethora of measurements coming from heterogeneous digital instruments and sensors, the sampling rate from corresponding signals is necessarily different at individual source and most likely largely undetermined at the aggregate level. The challenge is that of identifying specific data stratifications and segmentations, at the cost of relatively heavy computations.
With Big Data, not only the likely increase of dimensionality might augment the general complexity (spurious correlations, error propagation etc.) and affect the confidence in models, but in many cases the original data comes unstructured or based on a huge number of primitives, and in both cases either transformations or reductions are pursued. In general, the patterns at individual and population levels may differ substantially, and thus be hardly summarized by some statistics or predicted with some confidence.
It is expected that by integrating information from a variety of sources, the assimilation of the whole data spectrum could not incur in significant loss of information (a good example might be again a subset of patients responding to the same treatment). Therefore, big data in medicine, for instance, would benefit from the ability to recognize disease heterogeneity and to stratify even further in order to be more accurate in the assessment of therapies [2]. In such regards, we might thus consider the blessing of Big Data.
Finally, Figure 1 implies a key role for sufficient statistics, supposed to simplify statistical analysis [20]. A crucial question is: what is a sufficient statistics in Big Data? Can we achieve full information about the data from only a reduced set of it, considering that we only have a partial knowledge of the granularity of the original
Despite data liquidity flows fast and abundantly, Big Data does not represent a self-organized space, say
Therefore, hidden networks can depend on a network generator mechanism subject to some level of unknown uncertainty. Mixture mechanisms enable network approximation by a superposition of random networks (Poisson type, for large N) multiplied by the hidden variable distribution (HVD) [22,19]. Thus, given a Poisson-like
p(k)=∫π(λ)p(k|λ)dλ | (2) |
More in general, there exists an interplay between information and disequilibrium in a system, which can represent complexity
When we consider the space
The context built through
In
As said earlier with tensor networks, nodes are particularly interesting when their feature contents are considered, say a certain function
It is important to note that tensor networks recall a rich architecture of interconnected nodes whose glue is due to entanglement, telling about the underlying information [25]. Many topological measures provide information on network structures, including entropic ones. Mutual information can be for instance computed between two network modules, say
In general, quantum entanglement occurs when for interacting particles their quantum state cannot be described independently but only as a system. When such system interplays with the environment, a loss of information occurs, something generally called decoherence. In isomorphic graphs, defined by having adjacency matrix unique up to permutations of rows and columns, the same entanglement entropy is reflected into equivalent network states. The presence of synchronization [23] (say, nodes pulsing at the same frequency and thus representing a synchronized state) somehow ensures about the existence of entanglement between nodes, and protects the network from decoherence effects.
Network hubs are good candidate nodes to analyze node and edge dynamics. They have been the first object of network control, for instance. And it turns out they are not categorized as drivers due to the fact that due to their nature, the relatively large interconnected network regions receive similar signals through them, leaving unexplored many other regions which are possible targets. This is an effect of the presence of symmetries induced by the hubs, which reduce the number of nodes to be controlled but at the same time expand the number of edges through which the control signals flow [27].
In general, symmetries in a network induce, through some transformations, the invariance in its elements. This holds for transformations leaving the network s properties unchanged. Network invariant is called any property that is preserved under any of its possible isomorphisms, thus independently of its representation. A symmetry group
[1] | [ Dealing with data (special issue), Science, 331(2011), 639-806. |
[2] | [ R. B. Altman and E. A. Ashley, Using \Big Data" to dissect clinical heterogeneity, Circulation, 131(2015), 232-233. |
[3] | [ E. J. Candes, J. Romberg and T. Tao, Robust uncertainty principles:Exact signal reconstruction from highly incomplete frequency information, IEEE T. Inform. Theory, 52(2006), 489-509. |
[4] | [ E. Capobianco, Aliasing in gene feature detection by projective methods, J Bioinform Comput Biol, 7(2009), 685-700. |
[5] | [ N. V. Chavla, Data mining for imbalanced datasets:An overview, in Data Mining and Knowledge Discovery Handbook, Springer, (2005), 853-867. |
[6] | [ L. Demetrius and T. Manke, Robustness and network evolution:An entropic principle, Phys A, 346(2005), 682-696. |
[7] | [ D. L. Donoho, Compressed sensing, IEEE T. Inform. Theory, 52(2006), 1289-1306. |
[8] | [ Y. C. Eldar and G. Kutyniok, Compressed Sensing:Theory and Applications, Cambridge University Press, 2012. |
[9] | [ J. Fan, F. Han and H. Liu, Challenges of big data analysis, Nat Sci Rev, 1(2014), 293-314. |
[10] | [ S. Garnerone, P. Giorda and P. Zanardi, Bipartite quantum states and random complex networks, New J Phys, 14(2012), 013011. |
[11] | [ R. Gens and P. Domingos, Deep Symmetry Networks, Advances in Neural Information Processing Systems, 2014. |
[12] | [ U. Grenander, Probability and Statistics:The Harald Cramér Volume, Wiley, 1959. |
[13] | [ S. Havlin, E. Lopez, S. Buldyrev and H. E. Stanley, Anomalous conductance and diffusion in complex networks, Diff Fundam, 2(2005), 1-11. |
[14] | [ K. M. Lee, B. Mina and K. Gohb, Towards real-world complexity:An introduction to multiplex networks, Eur. Phys. J. B, 88(2015), p48. |
[15] | [ J. Leskovec, K. J. Lang, A. Dasgupta and M. W. Mahoney, Statistical properties of community structure in large social and information networks, Prooc. WWW 17th Int Conf, (2008), 695-704. |
[16] | [ B. G. Lindsay, Mixture models:theory, geometry and applications, NSF-CBMS Regional Conf. Ser. Prob. Stat 5(1995). |
[17] | [ R. Lopez-Ruiz, H. L. Mancini and X. Calbert, A statistical measure of complexity, Concepts and Recent Advances in Generalized Information Measures and Statistics, (2013), 147-168. |
[18] | [ C. Lynch, Big Data:How do your data grow?, Nature, 455(2008), 28-29. |
[19] | [ E. Marras, A. Travaglione and E. Capobianco, Sub-modular resolution analysis by network mixture models, Stat Appl Genet Mol Biol, 9(2010), Art 19, 43pp. |
[20] | [ A. Montanari, Computational implications of reducing data to sufficient statistics, Electron. J. Statist, 9(2015), 2370-2390. |
[21] | [ M. E. J. Newman, Modularity and community structure in networks, PNAS, 103(2006), 8577-8582. |
[22] | [ M. E. J. Newman and E. A. Leicht, Mixture models and exploratory analysis in networks, PNAS, 104(2007), 9564-9569. |
[23] | [ V. Nicosia, M. Valencia, M. Chavez, A. Diaz-Guilera and V. Latora, Remote synchronization reveals network symmetries and functional modules, Phys Rev Lett, 110(2013), 174102. |
[24] | [ B. Olshausen, Sparse Codes and Spikes, in Probabilistic Models of the Brain:Perception and Neural Function, (eds. R.P.N. Rao, B.A. Olshausen and M.S. Lewicki), MIT Press, 2002. |
[25] | [ R. Orus, A practical introduction to tensor networks:Matrix product states and projected entangled pair states, Ann Phys, 349(2014), 117-158. |
[26] | [ J. J. Ramasco and M. Mungan, Inversion method for content-based networks, Phys Rev E, 77(2008), 036122, 12 pp. |
[27] | [ J. J. Slotine and Y. Y. Liu, Complex Networks:The missing link, Nat Phys, 8(2012), 512-513. |
[28] | [ J. W. Vaupel and A. I Yashin, Heterogeneity's ruses:Some surprising effects of selection on population dynamics, Amer Statist, 39(1985), 176-185. |