{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:54:14Z","timestamp":1750308854393,"version":"3.41.0"},"reference-count":34,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2010,10,12]],"date-time":"2010-10-12T00:00:00Z","timestamp":1286841600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Database Syst."],"published-print":{"date-parts":[[2010,11]]},"abstract":"<jats:p>Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the input string. With advances in data collection and storage technologies, large strings have become ubiquitous, especially across emerging applications involving text, time series, and biological sequence data. To benefit from these advances, it is imperative that we have a scalable suffix tree construction algorithm.<\/jats:p>\n          <jats:p>The past few years have seen the emergence of several disk-based suffix tree construction algorithms. However, construction times continue to be daunting\u2014for example, indexing the entire human genome still takes over 30 hours on a system with 2 gigabytes of physical memory. In this article, we will empirically demonstrate and argue that all existing suffix tree construction algorithms have a severe limitation\u2014to glean reasonable disk I\/O efficiency, the input string being indexed must fit in main memory. This limitation is attributed to the poor locality exhibited by existing suffix tree construction algorithms and inhibits both sequential and parallel scalability. To deal with this limitation, we will show that through careful algorithm design, one of the simplest suffix tree construction algorithms can be rearchitected to build a suffix tree in a tiled manner, allowing the execution to operate within a fixed main memory budget when indexing strings of any size. We will also present a parallel extension of our algorithm that is designed for massively parallel systems like the IBM Blue Gene. An experimental evaluation will show that the proposed approach affords an improvement of several orders of magnitude in serial performance when indexing large strings. Furthermore, the proposed parallel extension is shown to be scalable\u2014it is now possible to index the entire human genome on a 1024 processor IBM Blue Gene system in under 15 minutes.<\/jats:p>","DOI":"10.1145\/1862919.1862922","type":"journal-article","created":{"date-parts":[[2010,12,20]],"date-time":"2010-12-20T15:55:04Z","timestamp":1292860504000},"page":"1-37","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["I\/O efficient algorithms for serial and parallel suffix tree construction"],"prefix":"10.1145","volume":"35","author":[{"given":"Amol","family":"Ghoting","sequence":"first","affiliation":[{"name":"IBM Thomas J. Watson Research Center, NY"}]},{"given":"Konstantin","family":"Makarychev","sequence":"additional","affiliation":[{"name":"IBM Thomas J. Watson Research Center, NY"}]}],"member":"320","published-online":{"date-parts":[[2010,10,12]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF01762122"},{"volume-title":"Proceedings of the IEEE International Conference on Data Engineering.","author":"Bedathur S.","key":"e_1_2_1_2_1"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1007\/11602569_8"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1101\/gr.789803"},{"volume-title":"Proceedings of the Asia-Pacific Bioinformatics Conference.","year":"2004","author":"Brown A.","key":"e_1_2_1_5_1"},{"key":"e_1_2_1_6_1","unstructured":"Burrows M. and Wheeler D. 1994. A block sorting lossless data compression algorithm. Tech. rep. Digital Equipment Corporation.  Burrows M. and Wheeler D. 1994. A block sorting lossless data compression algorithm. Tech. rep. Digital Equipment Corporation."},{"volume-title":"Proceedings of the 11th Conference on String Processing and Information Retrieval.","author":"Carvalho A.","key":"e_1_2_1_7_1"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF01185431"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2005.3"},{"volume-title":"Proceedings of the Annual Symposium on Combinatorial Pattern Matching.","author":"Clifford R.","key":"e_1_2_1_10_1"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/27.11.2369"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/30.11.2478"},{"volume-title":"Proceedings of the Annual Symposium on Foundations of Computer Science.","author":"Farach-Colton M.","key":"e_1_2_1_13_1"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/355541.355547"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-12200-2_60"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/1559845.1559931"},{"key":"e_1_2_1_17_1","doi-asserted-by":"crossref","unstructured":"Gusfield D. 1997. Algorithms on Strings Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press Cambridge UK.   Gusfield D. 1997. Algorithms on Strings Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press Cambridge UK.","DOI":"10.1017\/CBO9780511574931"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcss.2004.03.004"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/195058.195162"},{"volume-title":"Proceedings of 27th International Conference on Very Large Databases.","author":"Hunt E.","key":"e_1_2_1_20_1"},{"volume-title":"Proceedings of the Bioinformatics Workshop at the 21st Annual British National Conference on Databases.","year":"2004","author":"Japp R.","key":"e_1_2_1_21_1"},{"volume-title":"Reputer: The manifold applications of repeat analysis on a genome scale. Nucleic Acids Res. 29.","year":"2001","author":"Kurtz S.","key":"e_1_2_1_22_1"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1186\/gb-2004-5-2-r12"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/321941.321946"},{"volume-title":"Proceedings of 29th International Conference on Very Large Databases.","author":"Meek C.","key":"e_1_2_1_25_1"},{"key":"e_1_2_1_26_1","unstructured":"NCBI. Public collections of DNA and RNA sequence reach 100 gigabases. http:\/\/www.nlm.nih.gov\/news\/press_releases\/dna_rna_100_gig.html.  NCBI. Public collections of DNA and RNA sequence reach 100 gigabases. http:\/\/www.nlm.nih.gov\/news\/press_releases\/dna_rna_100_gig.html."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/1247480.1247572"},{"volume-title":"Proceedings of the Pacific Symposium on Biocomputing.","author":"Phoophakdee B.","key":"e_1_2_1_28_1"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1126\/science.287.5461.2204"},{"key":"e_1_2_1_30_1","unstructured":"Schurmann K. and Stoye J. 2003. Suffix tree construction and storage with limited main memory. Tech. rep. Universitat Bielefeld.  Schurmann K. and Stoye J. 2003. Suffix tree construction and storage with limited main memory. Tech. rep. Universitat Bielefeld."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-005-0154-8"},{"volume-title":"Proceedings of the IFIP 12th Work Computer Congress on Algorithms","year":"1992","author":"Ukkonen E.","key":"e_1_2_1_32_1"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/SWAT.1973.13"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/290941.290956"}],"container-title":["ACM Transactions on Database Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1862919.1862922","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1862919.1862922","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T21:14:51Z","timestamp":1750281291000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1862919.1862922"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2010,10,12]]},"references-count":34,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2010,11]]}},"alternative-id":["10.1145\/1862919.1862922"],"URL":"https:\/\/doi.org\/10.1145\/1862919.1862922","relation":{},"ISSN":["0362-5915","1557-4644"],"issn-type":[{"type":"print","value":"0362-5915"},{"type":"electronic","value":"1557-4644"}],"subject":[],"published":{"date-parts":[[2010,10,12]]},"assertion":[{"value":"2009-10-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2010-08-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2010-10-12","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}