×

The longest common substring problem. (English) Zbl 1364.68379

Summary: Given a set \(\mathcal{D}\) of \(q\) documents, the Longest Common Substring (LCS) problem asks, for any integer \(2 \leqslant k \leqslant q\), the longest substring that appears in \(k\) documents. LCS is a well-studied problem having a wide range of applications in Bioinformatics: from microarrays to DNA sequences alignments and analysis. This problem has been solved by L. C. K. Hui [Comput. Syst. Sci. Eng. 15, No. 2, 73–76 (2000; Zbl 0954.68065)] by using a famous constant-time solution to the Lowest Common Ancestor (LCA) problem in trees coupled with the use of suffix trees.
In this article, we present a simple method for solving the LCS problem by using suffix trees (STs) and classical union-find data structures. In turn, we show how this simple algorithm can be adapted in order to work with other space efficient data structures such as the enhanced suffix arrays (ESA) and the compressed suffix tree.

MSC:

68W32 Algorithms on strings

Citations:

Zbl 0954.68065
Full Text: DOI

References:

[1] AbeliukA., CánovasR. and NavarroG. (2013). Practical compressed suffix trees. Algorithms6(2)319-351.10.3390/a6020319 · Zbl 1461.68060
[2] AbouelhodaM. I., KurtzS. and OhlebuschE. (2004). Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms2(1)53-86.10.1016/S1570-8667(03)00065-0 · Zbl 1115.92303
[3] AhoA. V., HopcroftJ. E. and UllmanJ. D. (1976). On finding lowest common ancestors in trees. SIAM Journal on Computing5(1)115-132.10.1137/0205011 · Zbl 0325.68018
[4] AlstrupS., GørtzI. L., RauheT., ThorupM. and ZwickU. (2005). Union-find with constant time deletions. In: CairesL., ItalianoG. F., MonteiroL., PalamidessiC. and YungM. (eds.) ICALP, Springer Lecture Notes in Computer Science358078-89.10.1007/11523468_72184622 · Zbl 1084.68026
[5] Ben-AmramA. M. and YoffeS. (2011). A simple and efficient union-find-delete algorithm. Theoretical Computer Science412(4-5)487-492.10.1016/j.tcs.2010.11.005 · Zbl 1209.68150
[6] BenderM. A. and Farach-ColtonM. (2000). The LCA problem revisited. In: GonnetG. H., PanarioD. and ViolaA. (eds.) Proceedings of the Theoretical Informatics, 4th Latin American Symposium, Punta del Este, Uruguay, April 10-14, 2000. Springer Lecture Notes in Computer Science177688-94. · Zbl 0959.68133
[7] BerkmanO. and VishkinU. (1993). Recursive star-tree parallel data structure. SIAM Journal on Computing22(2)221-242.10.1137/0222017 · Zbl 0770.68044
[8] BreslauerD. and ItalianoG. F. (2013). Near real-time suffix tree construction via the fringe marked ancestor problem. Journal of Discrete Algorithms1832-48.10.1016/j.jda.2012.07.003 · Zbl 1267.68323
[9] CrochemoreM., HancartC. and LecroqT. (2007). Algorithms on Strings. Cambridge University Press, 392. · Zbl 1137.68060
[10] DillencourtM. B., SametH. and TamminenM. (1992). A general approach to connected-component labelling for arbitrary image representations. Journal of ACM39(2)253-280.10.1145/128749.128750 · Zbl 0799.68197
[11] Farach-ColtonM., FerraginaP. and MuthukrishnanS. (November 2000). On the sorting-complexity of suffix tree construction. Journal of ACM47(6)987-1011.10.1145/355541.355547 · Zbl 1094.68694
[12] FiorioC. and GustedtJ. (1996). Two linear time Union-Find strategies for image processing. Theoretical Computer Science154(2)165-181.10.1016/0304-3975(94)00262-2 · Zbl 0873.68212
[13] FischerJ. and HeunV. (2006). Theoretical and practical improvements on the RMQ-problem, with applications to LCA and LCE. In: LewensteinM. and ValienteG. (eds.) CPM, Springer Lecture Notes in Computer Science400936-48.10.1007/11780441_52272946 · Zbl 1196.68068
[14] FischerJ. and HeunV. (2007). A new succinct representation of RMQ-information and improvements in the enhanced suffix array. In: ChenB., PatersonM. and ZhangG. (eds.) ESCAPE, Springer Lecture Notes in Computer Science4614459-470.10.1007/978-3-540-74450-4_41 · Zbl 1176.68058
[15] FischerJ. and HeunV. (2008). Range median of minima queries, super-cartesian trees, and text indexing. In: MillerM. and WadaK. (eds.) IWOCA, College Publications239-252. · Zbl 1233.68128
[16] FischerJ., MäkinenV. and NavarroG. (2009). Faster entropy-bounded compressed suffix trees. Theoretical Computer Science410(51)5354-5364.10.1016/j.tcs.2009.09.012 · Zbl 1187.68171
[17] GabowH. N. and TarjanR. E. (1985). A linear-time algorithm for a special case of disjoint set union. Journal of Computer and System Sciences30(2)209-221.10.1016/0022-0000(85)90014-5 · Zbl 0572.68058
[18] GagieT., KärkkäinenJ., NavarroG. and PuglisiS. J. (2013). Colored range queries and document retrieval. Theoretical Computer Science48336-50.10.1016/j.tcs.2012.08.004 · Zbl 1292.68045
[19] GalilZ. and ItalianoG. F. (1991). Data structures and algorithms for disjoint set union problems. ACM Computing Surveys23(3)319-344.10.1145/116873.116878
[20] GogS. and OhlebuschE. (2013). Compressed suffix trees: Efficient computation and storage of LCP-values. ACM Journal of Experimental Algorithmics182.1:1-2.1:31. · Zbl 1322.68253
[21] GonnetG. H., PanarioD. and ViolaA. (eds.) (2000). In: Proceedings of the LATIN Theoretical Informatics, 4th Latin American Symposium, Punta del Este, Uruguay, April 10-14, 2000. Springer Lecture Notes in Computer Science 1776. · Zbl 0935.00049
[22] GrossiR. and VitterJ. S. (2005). Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing35(2)378-407.10.1137/S0097539702402354 · Zbl 1092.68115
[23] GusfieldD. (1997). Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology, Cambridge University Press.1460730 · Zbl 0934.68103
[24] GustedtJ. (1998). Efficient Union-Find for planar graphs and other sparse graph classes. Theoretical Computer Science203(1)123-141.10.1016/S0304-3975(97)00291-0 · Zbl 0913.68084 · doi:10.1016/S0304-3975(97)00291-0
[25] HarelD. and TarjanR. E. (1984). Fast algorithms for finding nearest common ancestors. SIAM Journal on Computing13(2)338-355.10.1137/0213024 · Zbl 0535.68022
[26] HuiL. C. K. (1992). Color set size problem with application to string matching. In: ApostolicoA., CrochemoreM., GalilZ. and ManberU. (eds.) CPM, Springer Lecture Notes in Computer Science644230-243.10.1007/3-540-56024-6_19
[27] HuiL. C. K. (2000). A practical algorithm to find longest common substring in linear time. International Journal of Computer Science and Engineering1573-76. · Zbl 0954.68065
[28] KärkkäinenJ., SandersP. and BurkhardtS. (2006). Simple linear work suffix array construction. Journal of ACM53(6)918-936.10.1145/1217856.1217858 · Zbl 1326.68111
[29] KimD. K., KimM. and ParkH. (2008). Linearized suffix tree: An efficient index data structure with the capabilities of suffix trees and suffix arrays. Algorithmica52(3)350-377.10.1007/s00453-007-9061-22452920 · Zbl 1163.68013
[30] KimD. K. and ParkH. (2005). A new compressed suffix tree supporting fast search and its construction algorithm using optimal working space. In: ApostolicoA., CrochemoreM. and ParkK. (eds.) CPM. Springer Lecture Notes in Computer Science353733-44.10.1007/11496656_4 · Zbl 1131.68428
[31] KimD. K., SimJ. S., ParkH. and ParkK. (2005). Constructing suffix arrays in linear time. Journal of Discrete Algorithms3(2-4)126-142.10.1016/j.jda.2004.08.019 · Zbl 1101.68505
[32] KoP. and AluruS. (2005). Space efficient linear time construction of suffix arrays. Journal of Discrete Algorithms3(2-4)143-156.10.1016/j.jda.2004.08.002 · Zbl 1101.68506
[33] LinJ., JiangY. and AdjerohD. A. (2009). The virtual suffix tree. International Journal of Foundations of Computer Science20(6)1109-1133.10.1142/S0129054109007066 · Zbl 1186.68136
[34] LoeblM. and NesetrilJ. (1988a). Linearity and unprovability of set union problem strategies. In: SimonJ. (eds.) STOC, ACM360-366. · Zbl 0874.68135
[35] LoeblM. and NesetrilJ. (1988b). Postorder hierarchy for path compressions and set union. In: DassowJ. and KelemenJ. (eds.) IMYCS, Springer Lecture Notes in Computer Science381146-151.
[36] LucasJ. M. (1990). Postorder disjoint set union is linear. SIAM Journal on Computing19(5)868-882.10.1137/0219060 · Zbl 0711.68062 · doi:10.1137/0219060
[37] ManberU. and MyersE. W. (1993). Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing22(5)935-948.10.1137/0222058 · Zbl 0784.68027
[38] McCreightE. M. (1976). A space-economical suffix tree construction algorithm. Journal of ACM23(2)262-272.10.1145/321941.321946 · Zbl 0329.68042 · doi:10.1145/321941.321946
[39] NavarroG. (March 2014) Spaces, trees, and colors: The algorithmic landscape of document retrieval on sequences. ACM Computing Surveys46(4)52:1-52:47. · Zbl 1305.68078
[40] NavarroG. and MäkinenV. (2007). Compressed full-text indexes. ACM Computing Surveys39(1). · Zbl 1321.68263
[41] OhlebuschE. (2013). Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction, Oldenbusch Verlag. · Zbl 1295.92011
[42] OhlebuschE. and GogS. (2009). A compressed enhanced suffix array supporting fast string matching. In: KarlgrenJ., TarhioJ. and HyyröH. (eds.) SPIRE, Springer Lecture Notes in Computer Science572151-62.10.1007/978-3-642-03784-9_62558196
[43] PuglisiS. J., SmythW. F. and TurpinA. H. (2007). A taxonomy of suffix array construction algorithms. ACM Computing Surveys39(2)1-31.10.1145/1216370.1216371
[44] RussoL. M. S., NavarroG. and OliveiraA. L. (2011) Fully compressed suffix trees.ACM Transactions on Algorithms7(4)53. · Zbl 1295.68103
[45] SadakaneK. (December 2007). Compressed suffix trees with full functionality. Theor. Comp. Sys.41(4)589-607.10.1007/s00224-006-1198-x · Zbl 1148.68015
[46] SchieberB. and VishkinU. (1988). On finding lowest common ancestors: Simplification and parallelization. SIAM Journal on Computing17(6)1253-1262.10.1137/0217079 · Zbl 0669.68049
[47] SzpankowskiW. (1993). A generalized suffix tree and its (un)expected asymptotic behaviors. SIAM Journal of Computing221176-1198.10.1137/0222070 · Zbl 0799.68050 · doi:10.1137/0222070
[48] SzpankowskiW. (2001). Average Case Analysis of Algorithms on Sequences, Wiley Series in Discrete Mathematics and Optimization, Wiley-Interscience. · Zbl 0969.00028
[49] UkkonenE. (1995). On-line construction of suffix trees. Algorithmica14(3)249-260.10.1007/BF012063311343552 · Zbl 0831.68027 · doi:10.1007/BF01206331
[50] WeinerP. (1973). Linear pattern matching algorithms. In: SWAT (FOCS), IEEE Computer Society, 1-11.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.