Space Efficient Data Structures for N-gram Retrieval

Fotios Kounelis; Christos Makris; Fotios Kounelis; Christos Makris

doi:10.3934/medsci.2017.4.426

AIMS Medical Science

2017, Volume 4, Issue 4: 426-440. doi: 10.3934/medsci.2017.4.426

Previous Article Next Article

Research article Special Issues

Space Efficient Data Structures for N-gram Retrieval

Fotios Kounelis ,
Christos Makris ^,

Department of Computer Engineering and Informatics, University of Patras, 26500 Greece

Received: 29 April 2017 Accepted: 18 October 2017 Published: 24 October 2017

A significant problem in computer science is the management of large data strings and a great number of works dealing with the specific problem has been published in the scientific literature. In this article, we use a technique to store efficiently biological sequences, making use of data structures like suffix trees and inverted files and also employing techniques like n-grams, in order to improve previous constructions. In our attempt, we drastically reduce the space needed to store the inverted indexes, by representing the substrings that appear more frequently in a more compact inverted index. Our technique is based on n-gram indexing, providing us the extra advantage of indexing sequences that cannot be separated in words. Moreover, our technique combines classical one level with two-level n-gram inverted file indexing. Our results suggest that the new proposed algorithm can compress the data more efficiently than previous attempts.
- suffix trees,
- n-gram indexing,
- data compression,
- biological sequences,
- inverted files,
- weighted sequences
Citation: Fotios Kounelis, Christos Makris. Space Efficient Data Structures for N-gram Retrieval[J]. AIMS Medical Science, 2017, 4(4): 426-440. doi: 10.3934/medsci.2017.4.426

Related Papers:

Abstract

A significant problem in computer science is the management of large data strings and a great number of works dealing with the specific problem has been published in the scientific literature. In this article, we use a technique to store efficiently biological sequences, making use of data structures like suffix trees and inverted files and also employing techniques like n-grams, in order to improve previous constructions. In our attempt, we drastically reduce the space needed to store the inverted indexes, by representing the substrings that appear more frequently in a more compact inverted index. Our technique is based on n-gram indexing, providing us the extra advantage of indexing sequences that cannot be separated in words. Moreover, our technique combines classical one level with two-level n-gram inverted file indexing. Our results suggest that the new proposed algorithm can compress the data more efficiently than previous attempts.

References

[1]	Blandford D, Blelloch G (2002) Index compression through document reordering. In: Storer JA and Cohn M, Eds., Proc. 2002 IEEE Data Compression Conference, April, IEEE Computer Society Press, Los Alamitos, CA. pp. 342-351.
[2]	Scholer F, Williams HE, Yiannis J, et al. (2002) Compression of inverted indexes for fast query evaluation. In: Beaulieu M, Baeza-Yates R, Myaeng SH and Jarvelin K, Eds., Proc. 25th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, August, Tampere, Finland, ACM Press, New York, pp. 222-229.
[3]	Brandon MC, Wallace DC, Baldi P (2009) Data structures and compression algorithms for genomic sequence data. Bioinformatics 25: 1731-1738. doi: 10.1093/bioinformatics/btp319
[4]	Gonzalo N, Veli M (2007) Compressed full-text indexes. ACM Computing Surveys 39: 2. doi: 10.1145/1216370.1216372
[5]	Ricardo AB, Berthier AR (2011) Modern Information Retrieval-the concepts and technology behind search, Second edition. Pearson Education Ltd., Harlow, England ISBN 978-0-321-41691-9.
[6]	Ian HW, Alistair M, Timothy CB (1999) Managing Gigabytes: Compressing and Indexing Documents and Images, Second Edition. Morgan Kaufmann ISBN 1-55860-570-3.
[7]	Weiner P (1973) Linear Pattern Matching Algorithms In 14th Annual IEEE Symposium on Switching and Automata Theory.
[8]	Ukkonen E (1995) On-Line Construction of Suffix Trees. Algorithmica: 249-260.
[9]	Iliopoulos C, Makris C, Panagis Y, et al. (2006) The Weighted Suffix Tree: An Efficient Data Structure for Handling Molecular Weighted Sequences and its Applications. Fundament Informa 71: 259-277.
[10]	Diamanti K, Kanavos A, Makris C, et al. (2014) Handling Weighted Sequences Employing Inverted Files and Suffix Trees. In Web Information Systems and Technologies, pp. 231-238, extended version, private communication with the authors.
[11]	Kim MS, Whang KY, Lee JG, et al. (2005) N-gram/2l: A space and time efficient two level n-gram inverted index structure. Int Conference Very Large Databases: 325-336.
[12]	Christodoulakis M, Iliopoulos CS, Mouchard L, et al. (2006) Computation of repetitions and regularities of biologically weighted sequences. J Comput Biol 13: 1214-1231. doi: 10.1089/cmb.2006.13.1214
[13]	Zhu H, Kollios G, Athitsos V (2012) A Generic Framework for Efficient and Effective Subsequence Retrieval. Proceed Very Large Data Bases: 1579-1590.
[14]	Amir A, Iliopoulos CS, Kapah O, et al. (2006). Approximate matching in weighted sequences. Combinator Pattern Match: 365376
[15]	Alatabbi A, Crochemore M, Iliopoulos CS, et al. (2012) Overlapping repetitions in weighted sequence. Int Informat Technol Conference: 435-440.
[16]	Zhang H, Guo Q, Iliopoulos CS (2010) An algorithmic framework for motif discovery problems in weighted sequences. Int Conference Algorithms Complex: 335-346.
[17]	Zhang H, Guo Q, Iliopoulos CS (2010) Varieties of regularities in weighted sequences. Algorithmic Aspect Informa Manage: 271-280.
[18]	Li G, Deng D, Feng J (2011) Faerie: Efficient Filtering Algorithm for Approximate Dictionary-based Entity Extraction. Special Interest Group Manage Data: 529-540.
[19]	Mouza C, Litwin W, Rigaux P, et al. (2009) AS-Index: A Structure for String Search Using n-grams and Algebraic Signatures. Confer Informat Knowledge: 295-304.
[20]	Marsan L, Sagot MF (2000) Extracting structured motifs using a suffix tree-algorithms and application to promoter consensus identification. Int Conference Res Comput Mol Biol: 210-219.
[21]	Sun Z, Yang J, Deogun JS (2004) Misae: A new approach for regulatory motif extraction. Computa System Bioinforma Conference: 173-181.
[22]	Holub J, Smyth WF (2003) Algorithms on indeterminate strings. In Australasian Workshop on Combinatorial Algorithms.
[23]	Holub J, Smyth WF, Wang S (2008) Fast pattern matching on indeterminate strings. J Discrete Algorithms 6: 37-50. doi: 10.1016/j.jda.2006.10.003
[24]	Manning CD, Raghavan P, Schutze H (2008) Introduction to Information Retrieval. Cambridge University Press.
[25]	Kim MS, Whang KY, Lee JG (2007) N-gram/2l-approximation: a two-level n-gram inverted index structure for approximate string matching. Compu System Sci Engineer 22: 6.

Reader Comments

Your name:*

Email:*
© 2017 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)