{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,30]],"date-time":"2025-07-30T11:43:20Z","timestamp":1753875800930,"version":"3.41.2"},"reference-count":46,"publisher":"Oxford University Press (OUP)","issue":"5","license":[{"start":{"date-parts":[[2024,5,17]],"date-time":"2024-05-17T00:00:00Z","timestamp":1715904000000},"content-version":"vor","delay-in-days":16,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62272253","62272252"],"award-info":[{"award-number":["62272253","62272252"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,5,2]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>The quality scores data (QSD) account for 70% in compressed FastQ files obtained from the short and long reads sequencing technologies. Designing effective compressors for QSD that counterbalance compression ratio, time cost, and memory consumption is essential in scenarios such as large-scale genomics data sharing and long-term data backup. This study presents a novel parallel lossless QSD-dedicated compression algorithm named PQSDC, which fulfills the above requirements well. PQSDC is based on two core components: a parallel sequences-partition model designed to reduce peak memory consumption and time cost during compression and decompression processes, as well as a parallel four-level run-length prediction mapping model to enhance compression ratio. Besides, the PQSDC algorithm is also designed to be highly concurrent using multicore CPU clusters.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>We evaluate PQSDC and four state-of-the-art compression algorithms on 27 real-world datasets, including 61.857 billion QSD characters and 632.908 million QSD sequences. (1) For short reads, compared to baselines, the maximum improvement of PQSDC reaches 7.06% in average compression ratio, and 8.01% in weighted average compression ratio. During compression and decompression, the maximum total time savings of PQSDC are 79.96% and 84.56%, respectively; the maximum average memory savings are 68.34% and 77.63%, respectively. (2) For long reads, the maximum improvement of PQSDC reaches 12.51% and 13.42% in average and weighted average compression ratio, respectively. The maximum total time savings during compression and decompression are 53.51% and 72.53%, respectively; the maximum average memory savings are 19.44% and 17.42%, respectively. (3) Furthermore, PQSDC ranks second in compression robustness among the tested algorithms, indicating that it is less affected by the probability distribution of the QSD collections. Overall, our work provides a promising solution for QSD parallel compression, which balances storage cost, time consumption, and memory occupation primely.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>The proposed PQSDC compressor can be downloaded from https:\/\/github.com\/fahaihi\/PQSDC.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btae323","type":"journal-article","created":{"date-parts":[[2024,5,17]],"date-time":"2024-05-17T19:38:58Z","timestamp":1715974738000},"source":"Crossref","is-referenced-by-count":3,"title":["PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping"],"prefix":"10.1093","volume":"40","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0290-142X","authenticated-orcid":false,"given":"Hui","family":"Sun","sequence":"first","affiliation":[{"name":"Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University , Tianjin 300350, China"}]},{"given":"Yingfeng","family":"Zheng","sequence":"additional","affiliation":[{"name":"Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University , Tianjin 300350, China"}]},{"given":"Haonan","family":"Xie","sequence":"additional","affiliation":[{"name":"Institute of Artificial Intelligence, School of Electrical Engineering, Guangxi University , Nanning 530004, China"}]},{"given":"Huidong","family":"Ma","sequence":"additional","affiliation":[{"name":"Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University , Tianjin 300350, China"}]},{"given":"Cheng","family":"Zhong","sequence":"additional","affiliation":[{"name":"Key Laboratory of Parallel, Distributed and Intelligent of Guangxi Universities and Colleges, School of Computer, Electronics and Information, Guangxi University , Nanning 530004, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5280-6653","authenticated-orcid":false,"given":"Meng","family":"Yan","sequence":"additional","affiliation":[{"name":"Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University , Tianjin 300350, China"}]},{"given":"Xiaoguang","family":"Liu","sequence":"additional","affiliation":[{"name":"Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University , Tianjin 300350, China"}]},{"given":"Gang","family":"Wang","sequence":"additional","affiliation":[{"name":"Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University , Tianjin 300350, China"}]}],"member":"286","published-online":{"date-parts":[[2024,5,17]]},"reference":[{"key":"2024053023111127100_btae323-B1","doi-asserted-by":"crossref","first-page":"2818","DOI":"10.1093\/bioinformatics\/btu390","article-title":"The scramble conversion tool","volume":"30","author":"Bonfield","year":"2014","journal-title":"Bioinformatics"},{"key":"2024053023111127100_btae323-B2","doi-asserted-by":"crossref","first-page":"e59190","DOI":"10.1371\/journal.pone.0059190","article-title":"Compression of FASTQ and sam format sequencing data","volume":"8","author":"Bonfield","year":"2013","journal-title":"PLoS One"},{"key":"2024053023111127100_btae323-B3","doi-asserted-by":"crossref","first-page":"337","DOI":"10.1093\/bioinformatics\/bty608","article-title":"Crumble: reference free lossy compression of sequence quality values","volume":"35","author":"Bonfield","year":"2019","journal-title":"Bioinformatics"},{"key":"2024053023111127100_btae323-B4","doi-asserted-by":"crossref","first-page":"2130","DOI":"10.1093\/bioinformatics\/btu183","article-title":"Lossy compression of quality scores in genomic data","volume":"30","author":"C\u00e1novas","year":"2014","journal-title":"Bioinformatics"},{"key":"2024053023111127100_btae323-B5","doi-asserted-by":"crossref","first-page":"2674","DOI":"10.1093\/bioinformatics\/bty1015","article-title":"Spring: a next-generation compressor for FASTQ data","volume":"35","author":"Chandak","year":"2019","journal-title":"Bioinformatics"},{"key":"2024053023111127100_btae323-B6","doi-asserted-by":"crossref","first-page":"294","DOI":"10.1186\/s12859-022-04837-1","article-title":"CMIC: an efficient quality score compressor with random access functionality","volume":"23","author":"Chen","year":"2022","journal-title":"BMC Bioinformatics"},{"key":"2024053023111127100_btae323-B7","doi-asserted-by":"crossref","first-page":"606","DOI":"10.1186\/s12859-021-04516-7","article-title":"FCLQC: fast and concurrent lossless quality scores compressor","volume":"22","author":"Cho","year":"2021","journal-title":"BMC Bioinformatics"},{"key":"2024053023111127100_btae323-B8","doi-asserted-by":"crossref","first-page":"4506","DOI":"10.1093\/bioinformatics\/btaa551","article-title":"ENANO: encoder for nanopore FASTQ files","volume":"36","author":"Dufort Y \u00c1lvarez","year":"2020","journal-title":"Bioinformatics"},{"key":"2024053023111127100_btae323-B9","doi-asserted-by":"crossref","first-page":"4862","DOI":"10.1093\/bioinformatics\/btab437","article-title":"RENANO: a reference-based compressor for nanopore FASTQ files","volume":"37","author":"Dufort Y \u00c1lvarez","year":"2021","journal-title":"Bioinformatics"},{"year":"2017","author":"Fu","first-page":"353","key":"2024053023111127100_btae323-B10"},{"key":"2024053023111127100_btae323-B11","doi-asserted-by":"crossref","first-page":"109","DOI":"10.1186\/s12859-020-3428-7","article-title":"LCQS: an efficient lossless compression tool of quality scores with random access functionality","volume":"21","author":"Fu","year":"2020","journal-title":"BMC Bioinformatics"},{"key":"2024053023111127100_btae323-B12","doi-asserted-by":"crossref","first-page":"3124","DOI":"10.1093\/bioinformatics\/btw385","article-title":"GeneCodeq: quality score compression and improved genotyping using a Bayesian framework","volume":"32","author":"Greenfield","year":"2016","journal-title":"Bioinformatics"},{"key":"2024053023111127100_btae323-B13","doi-asserted-by":"crossref","first-page":"baaa055","DOI":"10.1093\/database\/baaa055","article-title":"CNSA: a data repository for archiving omics data","volume":"2020","author":"Guo","year":"2020","journal-title":"Database"},{"year":"2016","author":"Hernaez","first-page":"261","key":"2024053023111127100_btae323-B14"},{"key":"2024053023111127100_btae323-B15","doi-asserted-by":"crossref","first-page":"19","DOI":"10.1146\/annurev-biodatasci-072018-021229","article-title":"Genomic data compression","volume":"2","author":"Hernaez","year":"2019","journal-title":"Annu Rev Biomed Data Sci"},{"key":"2024053023111127100_btae323-B16","doi-asserted-by":"crossref","first-page":"179","DOI":"10.1186\/s12859-017-1588-x","article-title":"LW-FQZip 2: a parallelized reference-based compression of FASTQ files","volume":"18","author":"Huang","year":"2017","journal-title":"BMC Bioinformatics"},{"year":"2010","author":"Ipavlov","key":"2024053023111127100_btae323-B17"},{"key":"2024053023111127100_btae323-B18","doi-asserted-by":"crossref","first-page":"441","DOI":"10.1038\/s41592-022-01432-3","article-title":"CoLoRd: compressing long reads","volume":"19","author":"Kokot","year":"2022","journal-title":"Nat Methods"},{"key":"2024053023111127100_btae323-B19","doi-asserted-by":"crossref","first-page":"e0232942","DOI":"10.1371\/journal.pone.0232942","article-title":"Vertical lossless genomic data compression tools for assembled genomes: a systematic literature review","volume":"15","author":"Kredens","year":"2020","journal-title":"PLoS One"},{"key":"2024053023111127100_btae323-B20","doi-asserted-by":"crossref","first-page":"2225","DOI":"10.1093\/bioinformatics\/btab102","article-title":"GenoZip: a universal extensible genomic data compressor","volume":"37","author":"Lan","year":"2021","journal-title":"Bioinformatics"},{"key":"2024053023111127100_btae323-B21","doi-asserted-by":"crossref","first-page":"351","DOI":"10.1093\/bioinformatics\/btab696","article-title":"FastqCLS: a FASTQ compressor for long-read sequencing via read reordering using a novel scoring model","volume":"38","author":"Lee","year":"2022","journal-title":"Bioinformatics"},{"key":"2024053023111127100_btae323-B22","doi-asserted-by":"crossref","first-page":"e1009229","DOI":"10.1371\/journal.pcbi.1009229","article-title":"Hamming-shifting graph of genomic short reads: efficient construction and its application for compression","volume":"17","author":"Liu","year":"2021","journal-title":"PLoS Comput Biol"},{"year":"2016","author":"Mahoney","key":"2024053023111127100_btae323-B23"},{"key":"2024053023111127100_btae323-B24","doi-asserted-by":"crossref","first-page":"3122","DOI":"10.1093\/bioinformatics\/btv330","article-title":"QVZ: lossy compression of quality values","volume":"31","author":"Malysa","year":"2015","journal-title":"Bioinformatics"},{"key":"2024053023111127100_btae323-B25","doi-asserted-by":"crossref","first-page":"140","DOI":"10.38094\/jastt1457","article-title":"A review on linear regression comprehensive in machine learning","volume":"1","author":"Maulud","year":"2020","journal-title":"JASTT"},{"key":"2024053023111127100_btae323-B26","doi-asserted-by":"crossref","first-page":"3276","DOI":"10.1093\/bioinformatics\/btv384","article-title":"LFQC: a lossless compression algorithm for FASTQ files","volume":"31","author":"Nicolae","year":"2015","journal-title":"Bioinformatics"},{"key":"2024053023111127100_btae323-B27","doi-asserted-by":"crossref","first-page":"2050031","DOI":"10.1142\/S0219720020500316","article-title":"CROMqs: an infinitesimal successive refinement lossy compressor for the quality scores","volume":"18","author":"No","year":"2020","journal-title":"J Bioinform Comput Biol"},{"key":"2024053023111127100_btae323-B28","doi-asserted-by":"crossref","first-page":"187","DOI":"10.1186\/1471-2105-14-187","article-title":"QualComp: a new lossy compressor for quality scores based on rate distortion theory","volume":"14","author":"Ochoa","year":"2013","journal-title":"BMC Bioinformatics"},{"volume-title":"An Introduction to Parallel Programming","year":"2011","author":"Pacheco","key":"2024053023111127100_btae323-B29"},{"key":"2024053023111127100_btae323-B30","doi-asserted-by":"crossref","first-page":"425","DOI":"10.1093\/bioinformatics\/btx607","article-title":"AQUa: an adaptive framework for compression of sequencing quality scores with random access functionality","volume":"34","author":"Paridaens","year":"2018","journal-title":"Bioinformatics"},{"key":"2024053023111127100_btae323-B31","doi-asserted-by":"crossref","first-page":"2213","DOI":"10.1093\/bioinformatics\/btu208","article-title":"DSRC2: industry oriented compression of FASTQ files","volume":"30","author":"Roguski","year":"2014","journal-title":"Bioinformatics"},{"key":"2024053023111127100_btae323-B32","doi-asserted-by":"crossref","first-page":"2748","DOI":"10.1093\/bioinformatics\/bty205","article-title":"FaStore: a space-saving solution for raw sequencing data","volume":"34","author":"Roguski","year":"2018","journal-title":"Bioinformatics"},{"volume-title":"Introduction to Data Compression","year":"2017","author":"Sayood","key":"2024053023111127100_btae323-B33"},{"key":"2024053023111127100_btae323-B34","doi-asserted-by":"crossref","first-page":"52","DOI":"10.1109\/6.591665","article-title":"Moore\u2019s law: past, present and future","volume":"34","author":"Schaller","year":"1997","journal-title":"IEEE Spectr"},{"year":"2019","author":"Seward","key":"2024053023111127100_btae323-B35"},{"year":"2023","author":"Sun","first-page":"60","key":"2024053023111127100_btae323-B36"},{"key":"2024053023111127100_btae323-B37","doi-asserted-by":"crossref","first-page":"454","DOI":"10.1186\/s12859-023-05566-9","article-title":"PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering","volume":"24","author":"Sun","year":"2023","journal-title":"BMC Bioinformatics"},{"key":"2024053023111127100_btae323-B38","doi-asserted-by":"crossref","first-page":"1141","DOI":"10.1089\/cmb.2018.0065","article-title":"A two-level scheme for quality score compression","volume":"25","author":"Voges","year":"2018","journal-title":"J Comput Biol"},{"volume-title":"DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP)","year":"2023","author":"Wetterstrand","key":"2024053023111127100_btae323-B39"},{"volume-title":"The CUDA Handbook: A Comprehensive Guide to GPU Programming","year":"2013","author":"Wilt","key":"2024053023111127100_btae323-B40"},{"key":"2024053023111127100_btae323-B41","doi-asserted-by":"crossref","first-page":"549","DOI":"10.1186\/s12859-017-1973-5","article-title":"GTZ: a fast compression and cloud transmission tool optimized for FASTQ files","volume":"18","author":"Xing","year":"2017","journal-title":"BMC Bioinformatics"},{"key":"2024053023111127100_btae323-B42","doi-asserted-by":"crossref","first-page":"4551","DOI":"10.1093\/bioinformatics\/btaa543","article-title":"ScaleQC: a scalable lossy to lossless solution for NGS data compression","volume":"36","author":"Yu","year":"2020","journal-title":"Bioinformatics"},{"year":"2014","author":"Yu","first-page":"385","key":"2024053023111127100_btae323-B43"},{"key":"2024053023111127100_btae323-B44","doi-asserted-by":"crossref","first-page":"240","DOI":"10.1038\/nbt.3170","article-title":"Quality score compression improves genotyping accuracy","volume":"33","author":"Yu","year":"2015","journal-title":"Nat Biotechnol"},{"key":"2024053023111127100_btae323-B45","doi-asserted-by":"crossref","first-page":"188","DOI":"10.1186\/s12859-015-0628-7","article-title":"Light-weight reference-based compression of FASTQ data","volume":"16","author":"Zhang","year":"2015","journal-title":"BMC Bioinformatics"},{"key":"2024053023111127100_btae323-B46","first-page":"160","article-title":"Parallel algorithm for sensitive sequence recognition from long-read genome data with high error rate","volume":"44","author":"Zhong","year":"2023","journal-title":"J Commun"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btae323\/57728452\/btae323.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/5\/btae323\/58011786\/btae323.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/5\/btae323\/58011786\/btae323.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,31]],"date-time":"2024-05-31T01:00:45Z","timestamp":1717117245000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btae323\/7676123"}},"subtitle":[],"editor":[{"given":"Can","family":"Alkan","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2024,5,1]]},"references-count":46,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2024,5,2]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btae323","relation":{},"ISSN":["1367-4811"],"issn-type":[{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2024,5,1]]},"published":{"date-parts":[[2024,5,1]]},"article-number":"btae323"}}