Bacteria classification using minimal absent words

Gabriele Fici; Alessio Langiu; Giosuè Lo Bosco; Riccardo Rizzo; Gabriele Fici; Alessio Langiu; Giosuè Lo Bosco; Riccardo Rizzo

doi:10.3934/medsci.2018.1.23

AIMS Medical Science

2018, Volume 5, Issue 1: 23-32. doi: 10.3934/medsci.2018.1.23

Previous Article Next Article

Research article Special Issues

Bacteria classification using minimal absent words

1.
Dipartimento di Matematica e Informatica, Università di Palermo, Palermo, Italy
2.
Department of Informatics, King’s College London, London, UK
3.
Istituto di Calcolo e Reti ad Alte Prestazioni, Consiglio Nazionale delle Ricerche, Palermo, Italy.

Received: 12 May 2017 Accepted: 24 November 2017 Published: 23 December 2017

Bacteria classification has been deeply investigated with different tools for many purposes,such as early diagnosis, metagenomics, phylogenetics. Classification methods based on ribosomalDNA sequences are considered a reference in this area. We present a new classificatier for bacteriaspecies based on a dissimilarity measure of purely combinatorial nature. This measure is based onthe notion of Minimal Absent Words, a combinatorial definition that recently found applications inbioinformatics. We can therefore incorporate this measure into a probabilistic neural network in orderto classify bacteria species. Our approach is motivated by the fact that there is a vast literature on thecombinatorics of Minimal Absent Words in relation with the degree of repetitiveness of a sequence.We ran our experiments on a public dataset of Ribosomal RNA Sequences from the complex 16S. Ourapproach showed a very high score in the accuracy of the classification, proving hence that our methodis comparable with the standard tools available for the automatic classification of bacteria species.
- bacteria classification,
- supervised classification,
- probabilistic neural network,
- minimal absent word,
- combinatorics on words
Citation: Gabriele Fici, Alessio Langiu, Giosuè Lo Bosco, Riccardo Rizzo. Bacteria classification using minimal absent words[J]. AIMS Medical Science, 2018, 5(1): 23-32. doi: 10.3934/medsci.2018.1.23

Related Papers:

Abstract

Bacteria classification has been deeply investigated with different tools for many purposes,such as early diagnosis, metagenomics, phylogenetics. Classification methods based on ribosomalDNA sequences are considered a reference in this area. We present a new classificatier for bacteriaspecies based on a dissimilarity measure of purely combinatorial nature. This measure is based onthe notion of Minimal Absent Words, a combinatorial definition that recently found applications inbioinformatics. We can therefore incorporate this measure into a probabilistic neural network in orderto classify bacteria species. Our approach is motivated by the fact that there is a vast literature on thecombinatorics of Minimal Absent Words in relation with the degree of repetitiveness of a sequence.We ran our experiments on a public dataset of Ribosomal RNA Sequences from the complex 16S. Ourapproach showed a very high score in the accuracy of the classification, proving hence that our methodis comparable with the standard tools available for the automatic classification of bacteria species.

References

[1]	Nelson KE, Weinstock GM (2010) A catalog of reference genomes from the human microbiome. Science 328: 994–999. doi: 10.1126/science.1183605
[2]	Soueidan H, NikolskiM(2016) Machine learning for metagenomics: methods and tools. Metagenomics 1: 1–19.
[3]	La Rosa M, Fiannaca A (2015) Probabilistic topic modeling for the analysis and classification of genomic sequences. BMC Bioinformatics 16.
[4]	Lo Bosco G, Pinello L (2015) A new feature selection methodology for K-mers representation of DNA sequences. In CIBB 2015 : 8623 of LNCS, 99–108.
[5]	Pinello L, Lo Bosco G, Hanlon B, et al. (2011) A motif-independent metric for DNA sequence specificity. BMC Bioinformatics 12.
[6]	Ferragina P, Giancarlo R, Greco V, et al. (2007) Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment. BMC Bioinformatics 8: 252.
[7]	Fiannaca A, La Rosa M, Rizzo R, et al. (2015) A k-mer-based barcode DNA classification methodology based on spectral representation and a neural gas network. Artificial Intelligence Med 64: 173–184. doi: 10.1016/j.artmed.2015.06.002
[8]	Specht DF (1990) Probabilistic neural networks. Neural Networks 3: 109–118. doi: 10.1016/0893-6080(90)90049-Q
[9]	Chairungsee S, Crochemore M (2012) Using minimal absent words to build phylogeny. Theoretical Computer Sci 450: 109–116. doi: 10.1016/j.tcs.2012.04.031
[10]	Cole JR, Wang Q, Cardenas E, et al. (2009) The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res 37: D141–D145. doi: 10.1093/nar/gkn879
[11]	Fici G (2006) Minimal Forbidden Words and Applications. PhD thesis, Université de Marne-la-Vallée.
[12]	Crochemore M, Mignosi F, Restivo A, et al., (1999) Text compression using antidictionaries. In ICALP 1999 Volume: 1644 of LNCS, 261–270.
[13]	Barton C, Héliou A, Mouchard L, et al. (2014) Linear-time Computation of Minimal Absent Words Using Suffx Array. BMC Bioinformatics 15: 388.
[14]	Crochemore M, Fici G, Mercas¸ R, et al. (2016) Linear-Time Sequence Comparison Using Minimal Absent Words & Applications. In LATIN 2016 Volume: 9644 of LNCS, 334–346.
[15]	Pinho AJ, Ferreira PJSG, Garcia SP, et al. (2009) On finding minimal absent words. BMC Bioinformatics 10: 137.
[16]	Barton C, Héliou A, Mouchard L, et al. (2015) Parallelising the computation of minimal absent words. In PPAM 2015, Volume: 9574 of LNCS, 243–253. Springer.
[17]	Fukae H, Ota T, Morita H (2012) On fast and memory-effcient construction of an antidictionary array. In: ISIT 2012, 1092–1096. IEEE.
[18]	Crochemore M, Mignosi F, Restivo A (1998) Automata and forbidden words. Information Processing Letters 67: 111–117. doi: 10.1016/S0020-0190(98)00104-5
[19]	Online content: MAW: a suite on the computation and application of Minimal Absent Words. Available from: https://github.com/solonas13/maw.
[20]	Wang Q, Garrity GM, Tiedje JM, et al. (2007) Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy, Applied Environmental Microbiology 73: 5261–5267.
[21]	Mao KZ, Tan KC, Ser W (2000) Probabilistic neural-network structure determination for pattern classification IEEE Transactions on neural networks. 11: 1009–1016.

Reader Comments

Your name:*

Email:*
© 2018 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)