×

Random access in persistent strings and segment selection. (English) Zbl 07729114

Summary: We consider compact representations of collections of similar strings that support random access queries. The collection of strings is given by a rooted tree where edges are labeled by an edit operation (inserting, deleting, or replacing a character) and a node represents the string obtained by applying the sequence of edit operations on the path from the root to the node. The goal is to compactly represent the entire collection while supporting fast random access to any part of a string in the collection. This problem captures natural scenarios such as representing the past history of an edited document or representing highly-repetitive collections. Given a tree with \(n\) nodes, we show how to represent the corresponding collection in \(O(n)\) space and \(O(\log n/ \log \log n)\) query time. This improves the previous time-space trade-offs for the problem. Additionally, we show a lower bound proving that the query time is optimal for any solution using near-linear space. To achieve our bounds for random access in persistent strings we show how to reduce the problem to the following natural geometric selection problem on line segments. Consider a set of horizontal line segments in the plane. Given parameters \(i\) and \(j\), a segment selection query returns the \(j\)th smallest segment (the segment with the \(j\)th smallest \(y\)-coordinate) among the segments crossing the vertical line through \(x\)-coordinate \(i\). The segment selection problem is to preprocess a set of horizontal line segments into a compact data structure that supports fast segment selection queries. We present a solution that uses \(O(n)\) space and support segment selection queries in \(O(\log n/ \log \log n)\) time, where \(n\) is the number of segments. Furthermore, we prove that that this query time is also optimal for any solution using near-linear space.

MSC:

68Pxx Theory of data
68Qxx Theory of computing
68Wxx Algorithms in computer science

References:

[1] Agarwal, PK; Arge, L.; Kaplan, H.; Molad, E.; Tarjan, RE; Yi, K., An optimal dynamic data structure for stabbing-semigroup queries, SIAM J. Comput., 41, 1, 104-127 (2012) · Zbl 1243.68157 · doi:10.1137/10078791X
[2] Barbay, J.; Claude, F.; Gagie, T.; Navarro, G.; Nekrich, Y., Efficient fully-compressed sequence representations, Algorithmica, 69, 1, 232-268 (2014) · Zbl 1307.68029 · doi:10.1007/s00453-012-9726-3
[3] Barbay, J., He, M., Munro, J.I., Rao, S.S.: Succinct indexes for strings, binary relations and multi-labeled trees. In: Proc. 18Th SODA, pp. 680-689 (2007) · Zbl 1302.68097
[4] Belazzougui, D., Cording, P.H., Puglisi, S.J., Tabei, Y.: Access, rank, and select in grammar-compressed strings. In: Proc. 23Rd ESA, pp. 142-154 (2015) · Zbl 1465.68119
[5] Belazzougui, D.; Navarro, G., Optimal lower and upper bounds for representing sequences, ACM Trans Algorithms, 11, 4, 1-21 (2015) · Zbl 1398.68103 · doi:10.1145/2629339
[6] Bille, P.; Christiansen, AR; Cording, PH; Gørtz, IL; Skjoldjensen, FR; Vildhøj, HW; Vind, S., Dynamic relative compression, dynamic partial sums, and substring concatenation, Algorithmica, 80, 11, 3207-3224 (2018) · Zbl 1410.68415 · doi:10.1007/s00453-017-0380-7
[7] Bille, P., Christiansen, A.R., Prezza, N., Skjoldjensen, F.R.: Succinct partial sums and fenwick trees. In: Proc. 24Th SPIRE, pp. 91-96 (2017) · Zbl 1454.68027
[8] Bille, P.; Ettienne, MB; Gørtz, IL; Vildhøj, HW, Time-space trade-offs for lempel-ziv compressed indexing, Theoret. Comput. Sci., 713, 66-77 (2018) · Zbl 1386.68057 · doi:10.1016/j.tcs.2017.12.021
[9] Bille, P., Gørtz, I.L.: Random access in persistent strings. In: Proc. 31St ISAAC (2020) · Zbl 07729114
[10] Bille, P.; Gørtz, IL; Landau, GM; Weimann, O., Tree compression with top trees, Inform. Comput., 243, 166-177 (2015) · Zbl 1327.68085 · doi:10.1016/j.ic.2014.12.012
[11] Bille, P.; Landau, GM; Raman, R.; Sadakane, K.; Satti, SR; Weimann, O., Random access to grammar-compressed strings and trees, SIAM J. Comput., 44, 3, 513-539 (2015) · Zbl 1329.68084 · doi:10.1137/130936889
[12] Chan, TM, Persistent predecessor search and orthogonal point location on the word RAM, ACM Trans. Algorithms, 9, 3, 1-22 (2013) · Zbl 1301.68236 · doi:10.1145/2483699.2483702
[13] Chan, TM; Pǎtraşcu, M., Transdichotomous results in computational geometry, i: Point location in sublogarithmic time, SIAM J. Comput., 39, 2, 703-729 (2009) · Zbl 1200.68122 · doi:10.1137/07068669X
[14] Chan, T.M., Tsakalidis, K.: Dynamic planar orthogonal point location in sublogarithmic time. In: Proc 34Th SoCG 2018 (2018) · Zbl 1489.68347
[15] Charikar, M.; Lehman, E.; Liu, D.; Panigrahy, R.; Prabhakaran, M.; Sahai, A.; Shelat, A., The smallest grammar problem, IEEE Trans. Inform. Theory, 51, 7, 2554-2576 (2005) · Zbl 1296.68086 · doi:10.1109/TIT.2005.850116
[16] Chazelle, B., Filtering search: A new approach to query-answering, SIAM J. Comput., 15, 3, 703-724 (1986) · Zbl 0612.68088 · doi:10.1137/0215051
[17] Chern, B., Ochoa, I., Manolakos, A., No, A., Venkat, K., Weissman, T.: Reference based genome compression. In: Proc. 12Th ITW, pp. 427-431 (2012)
[18] De Berg, M.; Vankreveld, M.; Snoeyink, J., Two-dimensional and three-dimensional point location in rectangular subdivisions, J. Algorithms, 18, 2, 256-277 (1995) · Zbl 0939.68936 · doi:10.1006/jagm.1995.1010
[19] Dietz, P.F.: Fully persistent arrays (extended array). In: Proceedings of the Workshop on Algorithms and Data Structures, Lecture Notes in Computer Science, vol. 382, pp. 67-74 (1989) · Zbl 0767.68017
[20] Dietz, P.F.: Optimal algorithms for list indexing and subset rank. In: Proc. 1St WADS, pp. 39-46 (1989) · Zbl 0766.68057
[21] Do, HH; Jansson, J.; Sadakane, K.; Sung, WK, Fast relative lempel-Ziv self-index for similar sequences, Theoret. Comput. Sci., 532, 14-30 (2014) · Zbl 1359.68334 · doi:10.1016/j.tcs.2013.07.024
[22] Driscoll, J.; Sarnak, N.; Sleator, D.; Tarjan, R., Making data structures persistent, J. Comput. System Sci., 38, 86-124 (1989) · Zbl 0667.68026 · doi:10.1016/0022-0000(89)90034-2
[23] Fenwick, PM, A new data structure for cumulative frequency tables, Software: Pract. Exper., 24, 3, 327-336 (1994)
[24] Ferragina, P.; Manzini, G.; Mäkinen, V.; Navarro, G., Compressed representations of sequences and full-text indexes, ACM Trans. Algorithms, 3, 2, 20 (2007) · Zbl 1321.68263 · doi:10.1145/1240233.1240243
[25] Ferragina, P.; Venturini, R., A simple storage scheme for strings achieving entropy bounds, Theoret. Comput. Sci., 372, 1, 115-121 (2007) · Zbl 1110.68029 · doi:10.1016/j.tcs.2006.12.012
[26] Fredman, M., Saks, M.: The cell probe complexity of dynamic data structures. In: Proc. 21St STOC, pp. 345-354 (1989)
[27] Fredman, ML; Willard, DE, Surpassing the information theoretic bound with fusion trees, J. Comput. System Sci., 47, 3, 424-436 (1993) · Zbl 0795.68049 · doi:10.1016/0022-0000(93)90040-4
[28] Fredman, ML; Willard, DE, Trans-dichotomous algorithms for minimum spanning trees and shortest paths, J. Comput. System Sci., 48, 3, 533-551 (1994) · Zbl 0806.68057 · doi:10.1016/S0022-0000(05)80064-9
[29] Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A faster grammar-based self-index. In: Proc. 6Th LATA, pp. 240-251 (2012) · Zbl 1351.68089
[30] Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: LZ77-based self-indexing with faster pattern matching. In: Proc. 14Th LATIN, pp. 731-742 (2014) · Zbl 1405.68104
[31] Gagie, T.; Gawrychowski, P.; Puglisi, SJ, Approximate pattern matching in lz77-compressed texts, J. Discrete Algorithms, 32, 64-68 (2015) · Zbl 1328.68327 · doi:10.1016/j.jda.2014.10.003
[32] Gagie, T., Karhu, K., Navarro, G., Puglisi, S.J., Sirén, J.: Document listing on repetitive collections. In: Proc. 24Th CPM, pp. 107-119 (2013) · Zbl 1381.68076
[33] Ganardi, M., Jez, A., Lohrey, M.: Balancing straight-line programs. In: Proc. 60Th FOCS, pp. 1169-1183 (2019) · Zbl 1499.68162
[34] Golynski, A., Munro, J.I., Rao, S.S.: Rank/Select operations on large alphabets: a tool for text indexing. In: Proc. 17Th SODA, pp. 368-373 (2006) · Zbl 1192.68800
[35] Golynski, A., Raman, R., Rao, S.S.: On the redundancy of succinct data structures. In: Proc. 11Th SWAT, pp. 148-159 (2008) · Zbl 1155.68374
[36] Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proc. 14Th SODA, pp. 841-850 (2003) · Zbl 1092.68584
[37] Grossi, R., Raman, R., Rao, S.S., Venturini, R.: Dynamic compressed strings with random access. In: Proc. 40Th ICALP, pp. 504-515 (2013) · Zbl 1336.68063
[38] Hon, WK; Sadakane, K.; Sung, WK, Succinct data structures for searchable partial sums with optimal worst-case performance, Theoret. Comput. Sci., 412, 39, 5176-5186 (2011) · Zbl 1226.68032 · doi:10.1016/j.tcs.2011.05.023
[39] Hoobin, C.; Puglisi, SJ; Zobel, J., Relative lempel-Ziv factorization for efficient storage and retrieval of web collections, Proc. VLDB Endowment, 5, 3, 265-273 (2011) · doi:10.14778/2078331.2078341
[40] Jørgensen, A.G., Larsen, K.G.: Range selection and median: tight cell probe lower bounds and adaptive data structures. In: Proc. 22Nd SODA, pp. 805-813 (2011) · Zbl 1373.68196
[41] Kempa, D., Prezza, N.: At the roots of dictionary compression: string attractors. In: Proc. 50Th STOC, pp. 827-840 (2018) · Zbl 1418.68085
[42] Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative lempel-ziv compression of genomes for large-scale storage and retrieval. In: Proc. 17Th SPIRE, pp. 201-206 (2010) · Zbl 1397.68073
[43] Kuruppu, S., Puglisi, S.J., Zobel, J.: Optimized relative lempel-ziv compression of genomes. In: Proc. 34Th ACSC, pp. 91-98 (2011) · Zbl 1397.68073
[44] Liao, SY; Devadas, S.; Keutzer, K., A text-compression-based method for code size minimization in embedded systems, Trans. Design Autom. Electr. Syst., 4, 1, 12-38 (1999) · doi:10.1145/298865.298867
[45] Liao, SY; Devadas, S.; Keutzer, K.; Tjiang, SWK; Wang, A., Code optimization techniques in embedded DSP microprocessors, Design. Autom. Emb. Sys., 3, 1, 59-73 (1998) · doi:10.1023/A:1008803430710
[46] Mäkinen, V.; Navarro, G.; Sirén, J.; Välimäki, N., Storage and retrieval of highly repetitive sequence collections, J. Comput. Biol., 17, 3, 281-308 (2010) · doi:10.1089/cmb.2009.0169
[47] Munro, J.I., Nekrich, Y.: Compressed data structures for dynamic sequences. In: Proc. 23Rd ESA, pp. 891-902 (2015) · Zbl 1466.68034
[48] Navarro, G.: Indexing highly repetitive collections. In: Proc. 23Rd IWOCA, pp. 274-279 (2012) · Zbl 1293.68087
[49] Navarro, G., Document listing on repetitive collections with guaranteed performance, Theoret. Comput. Sci., 772, 58-72 (2019) · Zbl 1422.68062 · doi:10.1016/j.tcs.2018.11.022
[50] Nekrich, Y.: A dynamic stabbing-max data structure with sub-logarithmic query time. In: Proc. 22Nd ISAAC, pp. 170-179 (2011) · Zbl 1350.68078
[51] Pătraşcu, M., Thorup, M.: Dynamic integer sets with optimal rank, select, and predecessor search. In: Proc. 55Th FOCS, pp. 166-175 (2014)
[52] Pǎtraşcu, M.; Demaine, ED, Logarithmic lower bounds in the cell-probe model, SIAM J. Comput., 35, 4, 932-963 (2006) · Zbl 1122.68044 · doi:10.1137/S0097539705447256
[53] Raman, R., Raman, V., Rao, S.S.: Succinct dynamic data structures. In: Proc. 7Th WADS, pp. 426-437 (2001) · Zbl 0997.68520
[54] Rytter, W., Application of Lempel-Ziv factorization to the approximation of grammar-based compression, Theoret. Comput. Sci., 302, 1-3, 211-222 (2003) · Zbl 1051.68088 · doi:10.1016/S0304-3975(02)00777-6
[55] Sadakane, K., Grossi, R.: Squeezing succinct data structures into entropy bounds. In: Proc. 17Th SODA, pp. 1230-1239 (2006) · Zbl 1192.68188
[56] Sarnak, N.; Tarjan, RE, Planar point location using persistent search trees, Commun. ACM, 29, 7, 669-679 (1986) · Zbl 0732.68102 · doi:10.1145/6138.6151
[57] Storer, J.A., Szymanski, T.G.: The macro model for data compression. In: Proc. 10Th STOC, pp. 30-39 (1978) · Zbl 1282.68097
[58] Storer, JA; Szymanski, TG, Data compression via textual substitution, J. ACM, 29, 4, 928-951 (1982) · Zbl 0489.68041 · doi:10.1145/322344.322346
[59] Tarjan, R.E., Vishkin, U.: Finding biconnected componemts and computing tree functions in logarithmic parallel time. In: Proc. 25Th FOCS, pp. 12-20 (1984)
[60] Verbin, E., Yu, W.: Data structure lower bounds on random access to grammar-compressed strings. In: Proc. 24Th CPM, pp. 247-258 (2013) · Zbl 1381.68073
[61] Ziv, J.; Lempel, A., A universal algorithm for sequential data compression, IEEE Trans. Inform. Theory, 23, 3, 337-343 (1977) · Zbl 0379.94010 · doi:10.1109/TIT.1977.1055714
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.