{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,16]],"date-time":"2026-03-16T00:23:17Z","timestamp":1773620597458,"version":"3.50.1"},"reference-count":31,"publisher":"Oxford University Press (OUP)","issue":"11","license":[{"start":{"date-parts":[[2018,10,23]],"date-time":"2018-10-23T00:00:00Z","timestamp":1540252800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"name":"NSERC Discovery Grant"},{"name":"TFRI NF PPG","award":["#1062"],"award-info":[{"award-number":["#1062"]}]},{"name":"NSERC Discovery Grant"},{"name":"NSERC CREATE Training Program in High-Dimensional"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2019,6,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>Next-Generation Sequencing has led to the availability of massive genomic datasets whose processing raises many challenges, including the handling of sequencing errors. This is especially pertinent in cancer genomics, e.g. for detecting low allele frequency variations from circulating tumor DNA. Barcode tagging of DNA molecules with unique molecular identifiers (UMI) attempts to mitigate sequencing errors; UMI tagged molecules are polymerase chain reaction (PCR) amplified, and the PCR copies of UMI tagged molecules are sequenced independently. However, the PCR and sequencing steps can generate errors in the sequenced reads that can be located in the barcode and\/or the DNA sequence. Analyzing UMI tagged sequencing data requires an initial clustering step, with the aim of grouping reads sequenced from PCR duplicates of the same UMI tagged molecule into a single cluster, and the size of the current datasets requires this clustering process to be resource-efficient.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>We introduce Calib, a computational tool that clusters paired-end reads from UMI tagged sequencing experiments generated by substitution-error-dominant sequencing platforms such as Illumina. Calib clusters are defined as connected components of a graph whose edges are defined in terms of both barcode similarity and read sequence similarity. The graph is constructed efficiently using locality sensitive hashing and MinHashing techniques. Calib\u2019s default clustering parameters are optimized empirically, for different UMI and read lengths, using a simulation module that is packaged with Calib. Compared to other tools, Calib has the best accuracy on simulated data, while maintaining reasonable runtime and memory footprint. On a real dataset, Calib runs with far less resources than alignment-based methods, and its clusters reduce the number of tentative false positive in downstream variation calling.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>Calib is implemented in C++ and its simulation module is implemented in Python. Calib is available at https:\/\/github.com\/vpc-ccg\/calib.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Supplementary information<\/jats:title>\n                  <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/bty888","type":"journal-article","created":{"date-parts":[[2018,10,22]],"date-time":"2018-10-22T19:42:39Z","timestamp":1540237359000},"page":"1829-1836","source":"Crossref","is-referenced-by-count":34,"title":["Alignment-free clustering of UMI tagged DNA molecules"],"prefix":"10.1093","volume":"35","author":[{"given":"Baraa","family":"Orabi","sequence":"first","affiliation":[{"name":"School of Computing Science, Faculty of Applied Sciences, Simon Fraser University, Burnaby BC, Canada"}]},{"given":"Emre","family":"Erhan","sequence":"additional","affiliation":[{"name":"School of Computing Science, Faculty of Applied Sciences, Simon Fraser University, Burnaby BC, Canada"}]},{"given":"Brian","family":"McConeghy","sequence":"additional","affiliation":[{"name":"Vancouver Prostate Centre, Vancouver BC, Canada"}]},{"given":"Stanislav V","family":"Volik","sequence":"additional","affiliation":[{"name":"Vancouver Prostate Centre, Vancouver BC, Canada"}]},{"given":"Stephane","family":"Le Bihan","sequence":"additional","affiliation":[{"name":"Vancouver Prostate Centre, Vancouver BC, Canada"}]},{"given":"Robert","family":"Bell","sequence":"additional","affiliation":[{"name":"Vancouver Prostate Centre, Vancouver BC, Canada"}]},{"given":"Colin C","family":"Collins","sequence":"additional","affiliation":[{"name":"Vancouver Prostate Centre, Vancouver BC, Canada"},{"name":"Department of Urologic Sciences, University of British Columbia, Vancouver BC, Canada"}]},{"given":"Cedric","family":"Chauve","sequence":"additional","affiliation":[{"name":"Department of Mathematics, Simon Fraser University, Burnaby BC, Canada"}]},{"given":"Faraz","family":"Hach","sequence":"additional","affiliation":[{"name":"Vancouver Prostate Centre, Vancouver BC, Canada"},{"name":"Department of Urologic Sciences, University of British Columbia, Vancouver BC, Canada"}]}],"member":"286","published-online":{"date-parts":[[2018,10,23]]},"reference":[{"key":"2023012713072353900_bty888-B1","doi-asserted-by":"crossref","first-page":"10574","DOI":"10.1038\/s41598-017-10269-2","article-title":"Targeted error-suppressed quantification of circulating tumor DNA using semi-degenerate barcoded adapters and biotinylated baits","volume":"7","author":"Alcaide","year":"2017","journal-title":"Sci. Rep."},{"key":"2023012713072353900_bty888-B2","first-page":"21","article-title":"On the resemblance and containment of documents","volume-title":"Proceedings of the Compression and Complexity of Sequences 1997","author":"Broder","year":"1997"},{"key":"2023012713072353900_bty888-B3","doi-asserted-by":"crossref","first-page":"2732","DOI":"10.1093\/bioinformatics\/bts482","article-title":"Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads","volume":"28","author":"Chong","year":"2012","journal-title":"Bioinformatics"},{"key":"2023012713072353900_bty888-B4","doi-asserted-by":"crossref","first-page":"351","DOI":"10.1136\/mp.54.5.351","article-title":"PCR amplification introduces errors into mononucleotide and dinucleotide repeat sequences","volume":"54","author":"Clarke","year":"2001","journal-title":"Mol. Pathol."},{"key":"2023012713072353900_bty888-B5","doi-asserted-by":"crossref","DOI":"10.1038\/srep37563","article-title":"A novel process of viral vector barcoding and library preparation enables high-diversity library generation and recombination-free paired-end sequencing","volume":"6","author":"Davidsson","year":"2016","journal-title":"Sci. Rep."},{"key":"2023012713072353900_bty888-B6","doi-asserted-by":"crossref","first-page":"491","DOI":"10.1038\/ng.806","article-title":"A framework for variation discovery and genotyping using next-generation DNA sequencing data","volume":"43","author":"DePristo","year":"2011","journal-title":"Nat. Genet."},{"key":"2023012713072353900_bty888-B7","doi-asserted-by":"crossref","first-page":"3150","DOI":"10.1093\/bioinformatics\/bts565","article-title":"CD-HIT: accelerated for clustering the next-generation sequencing data","volume":"28","author":"Fu","year":"2012","journal-title":"Bioinformatics"},{"key":"2023012713072353900_bty888-B8","doi-asserted-by":"crossref","first-page":"177","DOI":"10.1038\/nrc1299","article-title":"A census of human cancer genes","volume":"4","author":"Futreal","year":"2004","journal-title":"Nat. Rev. Cancer"},{"key":"2023012713072353900_bty888-B9","first-page":"3907","article-title":"Haplotype-based variant detection from short-read sequencing","volume":"1207","author":"Garrison","year":"2012","journal-title":"arXiv"},{"key":"2023012713072353900_bty888-B10","first-page":"518","article-title":"Similarity search in high dimensions via hashing","volume-title":"VLDB \u201899 Proceedings of the 25th International Conference on Very Large Data Bases","author":"Gionis","year":"1999"},{"key":"2023012713072353900_bty888-B11","doi-asserted-by":"crossref","first-page":"593","DOI":"10.1093\/bioinformatics\/btr708","article-title":"ART: a next-generation sequencing read simulator","volume":"28","author":"Huang","year":"2012","journal-title":"Bioinformatics"},{"key":"2023012713072353900_bty888-B12","doi-asserted-by":"crossref","first-page":"193","DOI":"10.1007\/BF01908075","article-title":"Comparing partitions","volume":"2","author":"Hubert","year":"1985","journal-title":"J. Classif."},{"key":"2023012713072353900_bty888-B13","doi-asserted-by":"crossref","first-page":"26","DOI":"10.1093\/bioinformatics\/btw536","article-title":"SiNVICT: ultra-sensitive detection of single nucleotide variants and indels in circulating tumour DNA","volume":"33","author":"Kockan","year":"2017","journal-title":"Bioinformatics"},{"key":"2023012713072353900_bty888-B14","doi-asserted-by":"crossref","first-page":"e0146638","DOI":"10.1371\/journal.pone.0146638","article-title":"Benefits and challenges with applying unique molecular identifiers in next generation sequencing to detect low frequency mutations","volume":"11","author":"Kou","year":"2016","journal-title":"PLoS One"},{"key":"2023012713072353900_bty888-B15","doi-asserted-by":"crossref","first-page":"269","DOI":"10.1093\/dnares\/dsv010","article-title":"High-fidelity target sequencing of individual molecules identified using barcode sequences: de novo detection and absolute quantitation of mutations in plasma cell-free DNA from cancer patients","volume":"22","author":"Kukita","year":"2015","journal-title":"DNA Res."},{"key":"2023012713072353900_bty888-B16","doi-asserted-by":"crossref","first-page":"452","DOI":"10.1093\/bioinformatics\/18.3.452","article-title":"Multiple sequence alignment using partial order graphs","volume":"18","author":"Lee","year":"2002","journal-title":"Bioinformatics"},{"key":"2023012713072353900_bty888-B17","first-page":"3997","article-title":"Aligning sequence reads, clone sequences and assembly contigs with bwa-mem","volume":"1303","author":"Li","year":"2013","journal-title":"arXiv"},{"key":"2023012713072353900_bty888-B18","doi-asserted-by":"crossref","first-page":"2103","DOI":"10.1093\/bioinformatics\/btw152","article-title":"Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences","volume":"32","author":"Li","year":"2016","journal-title":"Bioinformatics"},{"key":"2023012713072353900_bty888-B19","doi-asserted-by":"crossref","first-page":"42","DOI":"10.1186\/s40425-014-0042-0","article-title":"Circulating tumor DNA analysis as a real-time method for monitoring tumor burden in melanoma patients undergoing treatment with immune checkpoint blockade","volume":"2","author":"Lipson","year":"2014","journal-title":"J. Immunother. Cancer"},{"key":"2023012713072353900_bty888-B20","doi-asserted-by":"crossref","first-page":"19872","DOI":"10.1073\/pnas.1319590110","article-title":"High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing","volume":"110","author":"Lou","year":"2013","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2023012713072353900_bty888-B21","doi-asserted-by":"crossref","first-page":"547","DOI":"10.1038\/nbt.3520","article-title":"Integrated digital error suppression for improved detection of circulating tumor DNA","volume":"34","author":"Newman","year":"2016","journal-title":"Nat. Biotechnol."},{"key":"2023012713072353900_bty888-B22","doi-asserted-by":"crossref","first-page":"132","DOI":"10.1186\/s13059-016-0997-x","article-title":"Mash: fast genome and metagenome distance estimation using MinHash","volume":"17","author":"Ondov","year":"2016","journal-title":"Genome Biol."},{"key":"2023012713072353900_bty888-B23","first-page":"2825","article-title":"Scikit-learn: machine learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J. Mach. Learn. Res."},{"key":"2023012713072353900_bty888-B24","doi-asserted-by":"crossref","first-page":"912","DOI":"10.1038\/ng.3036","article-title":"Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications","volume":"46","author":"Rimmer","year":"2014","journal-title":"Nat. Genet."},{"key":"2023012713072353900_bty888-B25","doi-asserted-by":"crossref","first-page":"125","DOI":"10.1186\/s12859-016-0976-y","article-title":"Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data","volume":"17","author":"Schirmer","year":"2016","journal-title":"BMC Bioinformatics"},{"key":"2023012713072353900_bty888-B26","doi-asserted-by":"crossref","first-page":"426","DOI":"10.1038\/nrc3066","article-title":"Cell-free nucleic acids as biomarkers in cancer patients","volume":"11","author":"Schwarzenbach","year":"2011","journal-title":"Nat. Rev. Cancer"},{"key":"2023012713072353900_bty888-B27","doi-asserted-by":"crossref","first-page":"491","DOI":"10.1101\/gr.209601.116","article-title":"UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy","volume":"27","author":"Smith","year":"2017","journal-title":"Genome Res."},{"key":"2023012713072353900_bty888-B28","doi-asserted-by":"crossref","first-page":"180","DOI":"10.1186\/s13059-016-1039-4","article-title":"Streamlined analysis of duplex sequencing data with Du Novo","volume":"17","author":"Stoler","year":"2016","journal-title":"Genome Biol."},{"key":"2023012713072353900_bty888-B29","doi-asserted-by":"crossref","first-page":"737","DOI":"10.1101\/gr.214270.116","article-title":"Fast and accurate de novo genome assembly from long uncorrected reads","volume":"27","author":"Vaser","year":"2017","journal-title":"Genome Res."},{"key":"2023012713072353900_bty888-B31","doi-asserted-by":"crossref","first-page":"223","DOI":"10.1038\/nrc.2017.7","article-title":"Liquid biopsies come of age: towards implementation of circulating tumour DNA","volume":"17","author":"Wan","year":"2017","journal-title":"Nat. Rev. Cancer"},{"key":"2023012713072353900_bty888-B30","doi-asserted-by":"crossref","first-page":"1913","DOI":"10.1093\/bioinformatics\/btv053","article-title":"Starcode: sequence clustering based on all-pairs search","volume":"31","author":"Zorita","year":"2015","journal-title":"Bioinformatics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/35\/11\/1829\/48934855\/bioinformatics_35_11_1829.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/35\/11\/1829\/48934855\/bioinformatics_35_11_1829.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,27]],"date-time":"2023-01-27T14:12:27Z","timestamp":1674828747000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/35\/11\/1829\/5142725"}},"subtitle":[],"editor":[{"given":"Bonnie","family":"Berger","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2018,10,23]]},"references-count":31,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2019,6,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bty888","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2019,6,1]]},"published":{"date-parts":[[2018,10,23]]}}}