{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,3,25]],"date-time":"2025-03-25T17:39:53Z","timestamp":1742924393547,"version":"3.40.3"},"publisher-location":"Cham","reference-count":30,"publisher":"Springer Nature Switzerland","isbn-type":[{"type":"print","value":"9783031215339"},{"type":"electronic","value":"9783031215346"}],"license":[{"start":{"date-parts":[[2022,1,1]],"date-time":"2022-01-01T00:00:00Z","timestamp":1640995200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,1,18]],"date-time":"2023-01-18T00:00:00Z","timestamp":1674000000000},"content-version":"vor","delay-in-days":382,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>De novo genome assembly is a fundamental task in life sciences. It is mostly a typical big data problem with sometimes billions of reads, a big puzzle in which the genome is hidden. Memory and time efficient algorithms are sought, preferably to run even on desktops in labs. In this chapter we address some algorithmic problems related to genome assembly. We first present an algorithm which heavily reduces the size of input data, but with no essential compromize on the assembly quality. In such and many other algorithms in bioinformatics the counting of k-mers is a botleneck. We discuss counting in external memory. The construction of large parts of the genome, called contigs, can be modelled as the longest path problem or the Euler tour problem in some graphs build on reads or k-mers. We present a linear time streaming algorithm for constructing long paths in undirected graphs, and a streaming algorithm for the Euler tour problem with optimal one-pass complexity.<\/jats:p>","DOI":"10.1007\/978-3-031-21534-6_13","type":"book-chapter","created":{"date-parts":[[2023,1,17]],"date-time":"2023-01-17T20:02:53Z","timestamp":1673985773000},"page":"229-251","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Algorithms for\u00a0Big Data Problems in\u00a0de Novo Genome Assembly"],"prefix":"10.1007","author":[{"given":"Anand","family":"Srivastav","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Axel","family":"Wedemeyer","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Christian","family":"Schielke","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jan","family":"Schiemann","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2023,1,18]]},"reference":[{"key":"13_CR1","unstructured":"Brown, C.T., Howe, A., Zhang, Q., Pyrkosz, A.B., Brom, T.H.: A reference-free algorithm for computational normalization of shotgun sequencing data, pp. 1\u201318. ArXiv e-prints (2012). https:\/\/arxiv.org\/abs\/1203.4802"},{"key":"13_CR2","doi-asserted-by":"publisher","unstructured":"Bulterman, R.W., van\u00a0der Sommen, F.W., Zwaan, G., Verhoeff, T., van Gasteren, A.J.M., Feijen, W.H.J.: On computing a longest path in a tree. Inf. Process. Lett. 81(2), 93\u201396 (2002). https:\/\/doi.org\/10.1016\/S0020-0190(01)00198-3","DOI":"10.1016\/S0020-0190(01)00198-3"},{"issue":"11","key":"13_CR3","doi-asserted-by":"publisher","first-page":"987","DOI":"10.1038\/nbt.2023","volume":"29","author":"EC Compeau Phillip","year":"2011","unstructured":"Compeau Phillip, E.C., Pevzner Pavel, A., Tesler, G.: How to apply de Bruijn graphs to genome assembly. Nat. Biotechnol. 29(11), 987\u2013991 (2011). https:\/\/doi.org\/10.1038\/nbt.2023","journal-title":"Nat. Biotechnol."},{"issue":"1","key":"13_CR4","doi-asserted-by":"publisher","first-page":"58","DOI":"10.1016\/j.jalgor.2003.12.001","volume":"55","author":"G Cormode","year":"2005","unstructured":"Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58\u201375 (2005). https:\/\/doi.org\/10.1016\/j.jalgor.2003.12.001","journal-title":"J. Algorithms"},{"issue":"6","key":"13_CR5","doi-asserted-by":"publisher","first-page":"589","DOI":"10.1002\/spe.844","volume":"38","author":"R Dementiev","year":"2008","unstructured":"Dementiev, R., Kettner, L., Sanders, P.: STXXL: standard template library for XXL data sets. Softw. Pract. Exp. 38(6), 589\u2013637 (2008). https:\/\/doi.org\/10.1002\/spe.844","journal-title":"Softw. Pract. Exp."},{"issue":"44\u201346","key":"13_CR6","doi-asserted-by":"publisher","first-page":"3994","DOI":"10.1016\/j.tcs.2010.08.030","volume":"411","author":"C Demetrescu","year":"2010","unstructured":"Demetrescu, C., Escoffier, B., Moruz, G., Ribichini, A.: Adapting parallel algorithms to the w-stream model, with applications to graph problems. Theor. Comput. Sci. 411(44\u201346), 3994\u20134004 (2010). https:\/\/doi.org\/10.1016\/j.tcs.2010.08.030","journal-title":"Theor. Comput. Sci."},{"issue":"1","key":"13_CR7","doi-asserted-by":"publisher","first-page":"6:1","DOI":"10.1145\/1644015.1644021","volume":"6","author":"C Demetrescu","year":"2009","unstructured":"Demetrescu, C., Finocchi, I., Ribichini, A.: Trading off space for passes in graph streaming problems. ACM Trans. Algorithms 6(1), 6:1-6:17 (2009). https:\/\/doi.org\/10.1145\/1644015.1644021","journal-title":"ACM Trans. Algorithms"},{"key":"13_CR8","doi-asserted-by":"publisher","first-page":"160","DOI":"10.1186\/1471-2105-14-160","volume":"14","author":"S Deorowicz","year":"2013","unstructured":"Deorowicz, S., Debudaj-Grabysz, A., Grabowski, S.: Disk-based k-mer counting on a PC. BMC Bioinform. 14, 160 (2013). https:\/\/doi.org\/10.1186\/1471-2105-14-160","journal-title":"BMC Bioinform."},{"issue":"10","key":"13_CR9","doi-asserted-by":"publisher","first-page":"1569","DOI":"10.1093\/bioinformatics\/btv022","volume":"31","author":"S Deorowicz","year":"2015","unstructured":"Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569\u20131576 (2015). https:\/\/doi.org\/10.1093\/bioinformatics\/btv022","journal-title":"Bioinformatics"},{"issue":"1","key":"13_CR10","doi-asserted-by":"publisher","first-page":"19","DOI":"10.1006\/jagm.1997.0873","volume":"25","author":"M Dietzfelbinger","year":"1997","unstructured":"Dietzfelbinger, M., Hagerup, T., Katajainen, J., Penttonen, M.: A reliable randomized algorithm for the closest-pair problem. J. Algorithms 25(1), 19\u201351 (1997). https:\/\/doi.org\/10.1006\/jagm.1997.0873","journal-title":"J. Algorithms"},{"key":"13_CR11","series-title":"Lecture Notes in Computer Science","doi-asserted-by":"publisher","first-page":"531","DOI":"10.1007\/978-3-540-27836-8_46","volume-title":"Automata, Languages and Programming","author":"J Feigenbaum","year":"2004","unstructured":"Feigenbaum, J., Kannan, S., McGregor, A., Suri, S., Zhang, J.: On graph problems in a semi-streaming model. In: D\u00edaz, J., Karhum\u00e4ki, J., Lepist\u00f6, A., Sannella, D. (eds.) ICALP 2004. LNCS, vol. 3142, pp. 531\u2013543. Springer, Heidelberg (2004). https:\/\/doi.org\/10.1007\/978-3-540-27836-8_46"},{"key":"13_CR12","doi-asserted-by":"publisher","unstructured":"Gao, T., et al.: Bloomfish: a highly scalable distributed k-mer counting framework. In: IEEE ICPADS 2017, pp. 170\u2013179. IEEE Computer Society (2017). https:\/\/doi.org\/10.1109\/ICPADS.2017.00033","DOI":"10.1109\/ICPADS.2017.00033"},{"key":"13_CR13","doi-asserted-by":"publisher","unstructured":"Glazik, C., Schiemann, J., Srivastav, A.: Finding Euler tours in one pass in the W-streaming model with O(n log(n)) RAM. CoRR abs\/1710.04091 (2017). Theory of Computing Systems 2022, 23 p. Springer. https:\/\/doi.org\/10.1007\/s00224-022-10077-w","DOI":"10.1007\/s00224-022-10077-w"},{"issue":"10","key":"13_CR14","doi-asserted-by":"publisher","first-page":"R245","DOI":"10.1016\/s1074-5521(98)90108-9","volume":"5","author":"J Handelsman","year":"1998","unstructured":"Handelsman, J., Rondon, M.R., Brady, S.F., Clardy, J., Goodman, R.M.: Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chem. Biol. 5(10), R245-9 (1998). https:\/\/doi.org\/10.1016\/s1074-5521(98)90108-9","journal-title":"Chem. Biol."},{"issue":"11","key":"13_CR15","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/gb-2010-11-11-r116","volume":"11","author":"DR Kelley","year":"2010","unstructured":"Kelley, D.R., Schatz, M.C., Salzberg, S.L.: Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11(11), 1\u201313 (2010). https:\/\/doi.org\/10.1186\/gb-2010-11-11-r116","journal-title":"Genome Biol."},{"key":"13_CR16","doi-asserted-by":"publisher","unstructured":"Kliemann, L., Schielke, C., Srivastav, A.: A streaming algorithm for the undirected longest path problem. In: ESA, pp. 56:1\u201356:17. Schloss Dagstuhl - Leibniz-Zentrum f\u00fcr Informatik (2016). https:\/\/doi.org\/10.4230\/LIPIcs.ESA.2016.56","DOI":"10.4230\/LIPIcs.ESA.2016.56"},{"issue":"17","key":"13_CR17","doi-asserted-by":"publisher","first-page":"2759","DOI":"10.1093\/bioinformatics\/btx304","volume":"33","author":"M Kokot","year":"2017","unstructured":"Kokot, M., Dlugosz, M., Deorowicz, S.: KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33(17), 2759\u20132761 (2017). https:\/\/doi.org\/10.1093\/bioinformatics\/btx304","journal-title":"Bioinformatics"},{"key":"13_CR18","unstructured":"von Looz, M., Staudt, C.L., Meyerhenke, H., Prutkin, R.: Fast generation of dynamic complex networks with underlying hyperbolic geometry. CoRR abs\/1501.03545 (2015). http:\/\/arxiv.org\/abs\/1501.03545"},{"issue":"6","key":"13_CR19","doi-asserted-by":"publisher","first-page":"764","DOI":"10.1093\/bioinformatics\/btr011","volume":"27","author":"G Mar\u00e7ais","year":"2011","unstructured":"Mar\u00e7ais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764\u2013770 (2011). https:\/\/doi.org\/10.1093\/bioinformatics\/btr011","journal-title":"Bioinformatics"},{"key":"13_CR20","unstructured":"Nehls, C.: Effizientes sortier-basiertes Z\u00e4hlen von k-meren im externen Speicher. Mathematisches Seminar, Universit\u00e4t zu Kiel, Masterarbeit (2018)"},{"issue":"4","key":"13_CR21","doi-asserted-by":"publisher","first-page":"568","DOI":"10.1093\/bioinformatics\/btx636","volume":"34","author":"P Pandey","year":"2018","unstructured":"Pandey, P., Bender, M.A., Johnson, R., Patro, R.: Squeakr: an exact and approximate k-mer counting system. Bioinformatics 34(4), 568\u2013575 (2018). https:\/\/doi.org\/10.1093\/bioinformatics\/btx636","journal-title":"Bioinformatics"},{"key":"13_CR22","doi-asserted-by":"publisher","unstructured":"Pevzner, P.A., Tang, H., Waterman, M.S.: A new approach to fragment assembly in DNA sequencing. In: RECOMB, pp. 256\u2013267. ACM (2001). https:\/\/doi.org\/10.1145\/369133.369230","DOI":"10.1145\/369133.369230"},{"issue":"7","key":"13_CR23","doi-asserted-by":"publisher","first-page":"446","DOI":"10.1145\/363427.363463","volume":"10","author":"I Pohl","year":"1967","unstructured":"Pohl, I.: A method for finding Hamilton paths and knight\u2019s tours. Commun. ACM 10(7), 446\u2013449 (1967). https:\/\/doi.org\/10.1145\/363427.363463","journal-title":"Commun. ACM"},{"key":"13_CR24","unstructured":"Pohl, I., Stockmeyer, L.: Pohl-Warnsdorf - revisited. In: Proceedings of the ISC 2004 (2004). https:\/\/users.soe.ucsc.edu\/~pohl\/Papers\/Pohl_Stockmeyer_full.pdf"},{"key":"13_CR25","unstructured":"Pongr\u00e1cz, L.L.: A greedy approximation algorithm for the longest path problem in undirected graphs. CoRR abs\/1209.2503 (2012). withdrawn"},{"issue":"5","key":"13_CR26","doi-asserted-by":"publisher","first-page":"652","DOI":"10.1093\/bioinformatics\/btt020","volume":"29","author":"G Rizk","year":"2013","unstructured":"Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics 29(5), 652\u2013653 (2013). https:\/\/doi.org\/10.1093\/bioinformatics\/btt020","journal-title":"Bioinformatics"},{"issue":"14","key":"13_CR27","doi-asserted-by":"publisher","first-page":"1950","DOI":"10.1093\/bioinformatics\/btu132","volume":"30","author":"RS Roy","year":"2014","unstructured":"Roy, R.S., Bhattacharya, D., Schliep, A.: Turtle: identifying frequent k-mers with cache-efficient algorithms. Bioinformatics 30(14), 1950\u20131957 (2014). https:\/\/doi.org\/10.1093\/bioinformatics\/btu132","journal-title":"Bioinformatics"},{"key":"13_CR28","doi-asserted-by":"publisher","unstructured":"Sun, X., Woodruff, D.P.: Tight bounds for graph problems in insertion streams. In: APPROX-RANDOM, pp. 435\u2013448. Schloss Dagstuhl - Leibniz-Zentrum f\u00fcr Informatik (2015). https:\/\/doi.org\/10.4230\/LIPIcs.APPROX-RANDOM.2015.435","DOI":"10.4230\/LIPIcs.APPROX-RANDOM.2015.435"},{"issue":"1","key":"13_CR29","doi-asserted-by":"publisher","first-page":"324","DOI":"10.1186\/s12859-017-1724-7","volume":"18","author":"A Wedemeyer","year":"2017","unstructured":"Wedemeyer, A., Kliemann, L., Srivastav, A., Schielke, C., Reusch, T.B., Rosenstiel, P.: An improved filtering algorithm for big read datasets and its application to single-cell assembly. BMC Bioinform. 18(1), 324 (2017). https:\/\/doi.org\/10.1186\/s12859-017-1724-7","journal-title":"BMC Bioinform."},{"key":"13_CR30","unstructured":"W\u00f6lfel, P.: \u00dcber die Komplexit\u00e4t der Multiplikation in eingeschr\u00e4nkten Branchingprogrammmodellen. Ph.D. thesis, Technical University of Dortmund, Germany (2003). http:\/\/hdl.handle.net\/2003\/2539"}],"container-title":["Lecture Notes in Computer Science","Algorithms for Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/978-3-031-21534-6_13","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,17]],"date-time":"2023-01-17T20:04:43Z","timestamp":1673985883000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/978-3-031-21534-6_13"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022]]},"ISBN":["9783031215339","9783031215346"],"references-count":30,"URL":"https:\/\/doi.org\/10.1007\/978-3-031-21534-6_13","relation":{},"ISSN":["0302-9743","1611-3349"],"issn-type":[{"type":"print","value":"0302-9743"},{"type":"electronic","value":"1611-3349"}],"subject":[],"published":{"date-parts":[[2022]]},"assertion":[{"value":"18 January 2023","order":1,"name":"first_online","label":"First Online","group":{"name":"ChapterHistory","label":"Chapter History"}}]}}