{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,26]],"date-time":"2025-09-26T13:09:02Z","timestamp":1758892142409,"version":"3.37.3"},"reference-count":35,"publisher":"Springer Science and Business Media LLC","issue":"6","license":[{"start":{"date-parts":[[2020,7,20]],"date-time":"2020-07-20T00:00:00Z","timestamp":1595203200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,7,20]],"date-time":"2020-07-20T00:00:00Z","timestamp":1595203200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100007826","name":"Technische Universit\u00e4t Ilmenau","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100007826","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["The VLDB Journal"],"published-print":{"date-parts":[[2020,11]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>String dictionaries constitute a large portion of the memory footprint of database applications. While strong string dictionary compression algorithms exist, these come with impractical access and compression times. Therefore, lightweight algorithms such as front coding (PFC) are favored in practice. This paper endeavors to make strong string dictionary compression practical. We focus on Re-Pair Front Coding (RPFC), a grammar-based compression algorithm, since it consistently offers better compression ratios than other algorithms in the literature. To accelerate compression times, we propose block-based RPFC (BRPFC) which consists in independently compressing small blocks of the dictionary. For further accelerated compression times especially on large string dictionaries, we also propose an alternative version of BRPFC that uses sampling to speed up compression. Moreover, to accelerate access times, we devise a vectorized access method, using <jats:inline-formula><jats:alternatives><jats:tex-math>$$\\hbox {Intel}^{\\circledR }$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:msup>\n                    <mml:mtext>Intel<\/mml:mtext>\n                    <mml:mo>\u00ae<\/mml:mo>\n                  <\/mml:msup>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula>\u00a0Advanced Vector Extensions 512 (<jats:inline-formula><jats:alternatives><jats:tex-math>$$\\hbox {Intel}^{\\circledR }$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:msup>\n                    <mml:mtext>Intel<\/mml:mtext>\n                    <mml:mo>\u00ae<\/mml:mo>\n                  <\/mml:msup>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula>\u00a0AVX-512). Our experimental evaluation shows that sampled BRPFC offers compression times up to 190\u00a0<jats:inline-formula><jats:alternatives><jats:tex-math>$$\\times $$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mo>\u00d7<\/mml:mo>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula> faster than RPFC, and random string lookups 2.3\u00a0<jats:inline-formula><jats:alternatives><jats:tex-math>$$\\times $$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mo>\u00d7<\/mml:mo>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula> faster than RPFC on average. These results move our modified RPFC into a practical range for use in database systems because the overhead of Re-Pair-based compression for access times can be reduced by 2\u00a0<jats:inline-formula><jats:alternatives><jats:tex-math>$$\\times $$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mo>\u00d7<\/mml:mo>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula>.<\/jats:p>","DOI":"10.1007\/s00778-020-00620-x","type":"journal-article","created":{"date-parts":[[2020,7,20]],"date-time":"2020-07-20T12:04:21Z","timestamp":1595246661000},"page":"1263-1285","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Faster &amp; strong: string dictionary compression using sampling and fast vectorized decompression"],"prefix":"10.1007","volume":"29","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6491-001X","authenticated-orcid":false,"given":"Robert","family":"Lasch","sequence":"first","affiliation":[]},{"given":"Ismail","family":"Oukid","sequence":"additional","affiliation":[]},{"given":"Roman","family":"Dementiev","sequence":"additional","affiliation":[]},{"given":"Norman","family":"May","sequence":"additional","affiliation":[]},{"given":"Suleyman S.","family":"Demirsoy","sequence":"additional","affiliation":[]},{"given":"Kai-Uwe","family":"Sattler","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2020,7,20]]},"reference":[{"doi-asserted-by":"crossref","unstructured":"Abadi, D., Madden, S., Ferreira, M.: Integrating compression and execution in column-oriented database systems. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, pp. 671\u2013682 (2006)","key":"620_CR1","DOI":"10.1145\/1142473.1142548"},{"doi-asserted-by":"crossref","unstructured":"Lemke C., Sattler KU., Faerber F., Zeier A.: Speeding Up Queries in Column Stores. In: Bach Pedersen T., Mohania M.K., Tjoa A.M. (eds.) Data Warehousing and Knowledge Discovery. DaWaK 2010. Lecture Notes in Computer Science, vol 6263, pp 117\u2013129. Springer, Berlin, Heidelberg (2010)","key":"620_CR2","DOI":"10.1007\/978-3-642-15105-7_10"},{"issue":"3","key":"620_CR3","doi-asserted-by":"publisher","first-page":"55","DOI":"10.1145\/362084.362137","volume":"29","author":"T Westmann","year":"2000","unstructured":"Westmann, T., Kossmann, D., Helmer, S., Moerkotte, G.: The implementation and performance of compressed databases. ACM Sigmod Rec. 29(3), 55\u201367 (2000)","journal-title":"ACM Sigmod Rec."},{"issue":"1","key":"620_CR4","first-page":"28","volume":"35","author":"F F\u00e4rber","year":"2012","unstructured":"F\u00e4rber, F., May, N., Lehner, W., Gro\u00dfe, P., M\u00fcller, I., Rauhe, H., Dees, J.: The SAP HANA database\u2014an architecture overview. IEEE Data Eng. Bull. 35(1), 28\u201333 (2012)","journal-title":"IEEE Data Eng. Bull."},{"doi-asserted-by":"crossref","unstructured":"Arz, J., Fischer, J.: LZ-compressed string dictionaries. In: 2014 Data Compression Conference, pp. 322\u2013331 (2014)","key":"620_CR5","DOI":"10.1109\/DCC.2014.36"},{"key":"620_CR6","first-page":"3","volume":"19","author":"R Grossi","year":"2015","unstructured":"Grossi, R., Ottaviano, G.: Fast compressed tries through path decompositions. J. Exp. Algorithm. (JEA) 19, 3\u20134 (2015)","journal-title":"J. Exp. Algorithm. (JEA)"},{"key":"620_CR7","doi-asserted-by":"publisher","first-page":"73","DOI":"10.1016\/j.is.2015.08.008","volume":"56","author":"MA Mart\u00ednez-Prieto","year":"2016","unstructured":"Mart\u00ednez-Prieto, M.A., Brisaboa, N., C\u00e1novas, R., Claude, F., Navarro, G.: Practical compressed string dictionaries. Inf. Syst. 56, 73\u2013108 (2016)","journal-title":"Inf. Syst."},{"unstructured":"M\u00fcller, I., Ratsch, C., F\u00e4rber, F.: Adaptive string dictionary compression in in-memory column-store database systems. In: EDBT, pp. 283\u2013294 (2014)","key":"620_CR8"},{"issue":"11","key":"620_CR9","doi-asserted-by":"publisher","first-page":"1722","DOI":"10.1109\/5.892708","volume":"88","author":"NJ Larsson","year":"2000","unstructured":"Larsson, N.J., Moffat, A.: Off-line dictionary-based compression. Proc. IEEE 88(11), 1722\u20131732 (2000)","journal-title":"Proc. IEEE"},{"doi-asserted-by":"crossref","unstructured":"Brisaboa, N.R., C\u00e1novas, R., Claude, F., Mart\u00ednez-Prieto, M.A., Navarro, G.: Compressed string dictionaries. In: International Symposium on Experimental Algorithms, pp. 136\u2013147. Springer (2011)","key":"620_CR10","DOI":"10.1007\/978-3-642-20662-7_12"},{"doi-asserted-by":"crossref","unstructured":"Lasch, R., Oukid, I., Dementiev, R., May, N., Demirsoy, S.S., Sattler, K.U.: Fast & strong: the case of compressed string dictionaries on modern CPUs. In: Proceedings of the 15th International Workshop on Data Management on New Hardware, p.\u00a04. ACM (2019)","key":"620_CR11","DOI":"10.1145\/3329785.3329924"},{"issue":"4","key":"620_CR12","doi-asserted-by":"publisher","first-page":"65","DOI":"10.1145\/1498765.1498785","volume":"52","author":"S Williams","year":"2009","unstructured":"Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65\u201376 (2009)","journal-title":"Commun. ACM"},{"unstructured":"$$\\text{Intel}^{\\textregistered }$$\u00a0Architecture Instruction Set Extensions and Future Features Programming Reference. https:\/\/software.intel.com\/en-us\/download\/intel-architecture-instruction-set-extensions-and-future-features-programming-reference (2019). Accessed 09 Dec 2019","key":"620_CR13"},{"unstructured":"$$\\text{ Intel}^{\\textregistered }$$\u00a0Intrinsics Guide. https:\/\/software.intel.com\/sites\/landingpage\/IntrinsicsGuide\/ (2020). Accessed 08 May 2020","key":"620_CR14"},{"unstructured":"Willhalm, T., Oukid, I., M\u00fcller, I., F\u00e4rber, F.: Vectorizing database column scans with complex predicates. In: International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures\u2014ADMS 2013, pp. 1\u201312 (2013)","key":"620_CR15"},{"issue":"1","key":"620_CR16","doi-asserted-by":"publisher","first-page":"385","DOI":"10.14778\/1687627.1687671","volume":"2","author":"T Willhalm","year":"2009","unstructured":"Willhalm, T., Popovici, N., Boshmaf, Y., Plattner, H., Zeier, A., Schaffner, J.: SIMD-scan: ultra fast in-memory table scan using on-chip vector processing units. Proc. VLDB Endow. 2(1), 385\u2013394 (2009)","journal-title":"Proc. VLDB Endow."},{"doi-asserted-by":"crossref","unstructured":"Zhang, H., Liu, X., Andersen, D.G., Kaminsky, M., Keeton, K., Pavlo, A.: Order-preserving key compression for in-memory search trees. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (2020)","key":"620_CR17","DOI":"10.1145\/3318464.3380583"},{"unstructured":"Clark, D.: Compact pat trees. PhD thesis, University of Waterloo (1998)","key":"620_CR18"},{"doi-asserted-by":"crossref","unstructured":"Kanda, S., Morita, K., Fuketa, M.: Practical string dictionary compression using string dictionary encoding. In: 2017 International Conference on Big Data Innovations and Applications (Innovate-Data), pp. 1\u20138. IEEE (2017)","key":"620_CR19","DOI":"10.1109\/Innovate-Data.2017.9"},{"doi-asserted-by":"crossref","unstructured":"Boncz, P., Neumann, T., Leis, V.: FSST: fast random access string compression. Preprint, available at https:\/\/github.com\/cwida\/fsst\/raw\/master\/fsstcompression.pdf. Accessed 11 May 2020","key":"620_CR20","DOI":"10.14778\/3407790.3407851"},{"doi-asserted-by":"crossref","unstructured":"Yasin, A.: A top-down method for performance analysis and counters architecture. In: 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 35\u201344 (2014)","key":"620_CR21","DOI":"10.1109\/ISPASS.2014.6844459"},{"unstructured":"$$\\text{ Intel}^{\\textregistered }$$\u00a0VTune\u2122\u00a0Amplifier. https:\/\/software.intel.com\/en-us\/vtune (2019). Accessed 09 Dec 2019","key":"620_CR22"},{"key":"620_CR23","doi-asserted-by":"publisher","first-page":"35","DOI":"10.1007\/978-3-030-32686-9_3","volume-title":"String Processing and Information Retrieval","author":"TI Gagie","year":"2019","unstructured":"Gagie, T.I., Manzini, T., Navarro, G., Sakamoto, G., Takabatake, H.Y.: Rpair: rescaling repair with rsync. In: Brisaboa, N.R., Puglisi, S.J. (eds.) String Processing and Information Retrieval, pp. 35\u201344. Springer, Berlin (2019)"},{"key":"620_CR24","volume-title":"Uniform Distribution of Sequences","author":"L Kuipers","year":"2012","unstructured":"Kuipers, L., Niederreiter, H.: Uniform Distribution of Sequences. Courier Corporation, Chelmsford (2012)"},{"issue":"6","key":"620_CR25","doi-asserted-by":"publisher","first-page":"333","DOI":"10.1145\/360825.360855","volume":"18","author":"AV Aho","year":"1975","unstructured":"Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18(6), 333\u2013340 (1975). https:\/\/doi.org\/10.1145\/360825.360855","journal-title":"Commun. ACM"},{"unstructured":"$$\\text{ Intel}^{\\textregistered }$$\u00a0$$\\text{ Xeon}^{\\textregistered }$$\u00a0Platinum 8180 Processor. https:\/\/ark.intel.com\/content\/www\/us\/en\/ark\/products\/120496\/intel-xeon-platinum-8180-processor-38-5m-cache-2-50-ghz.html (2017). Accessed 09 Dec 2019","key":"620_CR26"},{"unstructured":"Laboratory for Web Algorithmics\u2014Datasets. http:\/\/law.di.unimi.it\/datasets.php (2019). Accessed 09 Dec 2019","key":"620_CR27"},{"issue":"8","key":"620_CR28","doi-asserted-by":"publisher","first-page":"711","DOI":"10.1002\/spe.587","volume":"34","author":"P Boldi","year":"2004","unstructured":"Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: a scalable fully distributed web crawler. Softw. Pract. Exp. 34(8), 711\u2013726 (2004)","journal-title":"Softw. Pract. Exp."},{"unstructured":"Wikimedia Database Dumps. https:\/\/dumps.wikimedia.org\/ (2019). Accessed 13 Jan 2020. Downloaded: 2019-03-06","key":"620_CR29"},{"unstructured":"GeoNames dump. http:\/\/download.geonames.org\/export\/dump\/ (2019). Accessed 13 Jan 2020. Downloaded: 2019-03-06","key":"620_CR30"},{"unstructured":"Snappy, a fast compressor\/decompressor. https:\/\/github.com\/google\/snappy. Accessed 05 April 2020","key":"620_CR31"},{"unstructured":"Snzip, a compression\/decompression tool based on snappy. https:\/\/github.com\/kubo\/snzip. Accessed 05 April 2020","key":"620_CR32"},{"issue":"4","key":"620_CR33","first-page":"971","volume":"8","author":"S Yoshida","year":"2013","unstructured":"Yoshida, S., Kida, T.: A variable-length-to-fixed-length coding method using a re-pair algorithm. Inf. Media Technol. 8(4), 971\u2013977 (2013)","journal-title":"Inf. Media Technol."},{"unstructured":"Intel Corporation: Improving Real-Time Performance by Utilizing Cache Allocation Technology. https:\/\/www.intel.com\/content\/dam\/www\/public\/us\/en\/documents\/white-papers\/cache-allocation-technology-white-paper.pdf (2015). Accessed 13 Jan 2020","key":"620_CR34"},{"doi-asserted-by":"crossref","unstructured":"Moerkotte, G., DeHaan, D., May, N., Nica, A., Boehm, A.: Exploiting ordered dictionaries to efficiently construct histograms with q-error guarantees in SAP HANA. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 361\u2013372. ACM (2014)","key":"620_CR35","DOI":"10.1145\/2588555.2595629"}],"container-title":["The VLDB Journal"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00778-020-00620-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00778-020-00620-x\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00778-020-00620-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,7,19]],"date-time":"2021-07-19T23:58:15Z","timestamp":1626739095000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00778-020-00620-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,7,20]]},"references-count":35,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2020,11]]}},"alternative-id":["620"],"URL":"https:\/\/doi.org\/10.1007\/s00778-020-00620-x","relation":{},"ISSN":["1066-8888","0949-877X"],"issn-type":[{"type":"print","value":"1066-8888"},{"type":"electronic","value":"0949-877X"}],"subject":[],"published":{"date-parts":[[2020,7,20]]},"assertion":[{"value":"14 January 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 June 2020","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 June 2020","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"20 July 2020","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}