{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,28]],"date-time":"2026-01-28T08:08:26Z","timestamp":1769587706499,"version":"3.49.0"},"reference-count":95,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2017,9,30]],"date-time":"2017-09-30T00:00:00Z","timestamp":1506729600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Melbourne International Research Scholarship from the University of Melbourne"},{"name":"Australian Research Council through a Discovery Project","award":["DP150101550"],"award-info":[{"award-number":["DP150101550"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["J. Data and Information Quality"],"published-print":{"date-parts":[[2017,9,30]]},"abstract":"<jats:p>The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale biological studies. However, the underlying data quality of these resources is a critical concern. A particular challenge is duplication, in which multiple records have similar sequences, creating a high level of redundancy that impacts database storage, curation, and search. Biological database deduplication has two direct applications: for database curation, where detected duplicates are removed to improve curation efficiency, and for database search, where detected duplicate sequences may be flagged but remain available to support analysis.<\/jats:p><jats:p>Clustering methods have been widely applied to biological sequences for database deduplication. Since an exhaustive all-by-all pairwise comparison of sequences cannot scale for a high volume of data, heuristic approaches have been recruited, such as the use of simple similarity thresholds. In this article, we present a comparison between CD-HIT and UCLUST, the two best-known clustering tools for sequence database deduplication. Our contributions include a detailed assessment of the redundancy remaining after deduplication, application of standard clustering evaluation metrics to quantify the cohesion and separation of the clusters generated by each method, and a biological case study that assesses intracluster function annotation consistency to demonstrate the impact of these factors on a practical application of the sequence clustering methods. Our results show that the trade-off between efficiency and accuracy becomes acute when low threshold values are used and when cluster sizes are large. This evaluation leads to practical recommendations for users for more effective uses of the sequence clustering tools for deduplication.<\/jats:p>","DOI":"10.1145\/3131611","type":"journal-article","created":{"date-parts":[[2018,1,29]],"date-time":"2018-01-29T13:48:05Z","timestamp":1517233685000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases"],"prefix":"10.1145","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6036-1516","authenticated-orcid":false,"given":"Qingyu","family":"Chen","sequence":"first","affiliation":[{"name":"University of Melbourne, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yu","family":"Wan","sequence":"additional","affiliation":[{"name":"University of Melbourne, Victoria, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiuzhen","family":"Zhang","sequence":"additional","affiliation":[{"name":"RMIT University, Melbourne VIC"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yang","family":"Lei","sequence":"additional","affiliation":[{"name":"University of Melbourne, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Justin","family":"Zobel","sequence":"additional","affiliation":[{"name":"University of Melbourne, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Karin","family":"Verspoor","sequence":"additional","affiliation":[{"name":"University of Melbourne, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2018,1,27]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/1498759.1498766"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0022-2836(05)80360-2"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2012.07.021"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1038\/75556"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/1541880.1541883"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/gku1216"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/TFUZZ.2016.2540063"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/gkw1132"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4939-3167-5_2"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1093\/database\/baw139"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1016\/S1570-9639(03)00112-2"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/BIBM.2016.7822604"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2811163.2811175"},{"key":"e_1_2_1_14_1","volume-title":"Benchmarks for measurement of duplicate detection methods in nucleotide databases","author":"Chen Qingyu","year":"2017","unstructured":"Qingyu Chen , Justin Zobel , and Karin Verspoor . 2017. Benchmarks for measurement of duplicate detection methods in nucleotide databases . Database : The Journal of Biological Databases and Curation 2017 , baw164. Qingyu Chen, Justin Zobel, and Karin Verspoor. 2017. Benchmarks for measurement of duplicate detection methods in nucleotide databases. Database: The Journal of Biological Databases and Curation 2017, baw164."},{"key":"e_1_2_1_15_1","volume-title":"redundancies, and inconsistencies in the primary nucleotide databases: A descriptive study","author":"Chen Qingyu","year":"2017","unstructured":"Qingyu Chen , Justin Zobel , and Karin Verspoor . 2017. Duplicates , redundancies, and inconsistencies in the primary nucleotide databases: A descriptive study . Database : The Journal of Biological Databases and Curation 2017 , baw163. Qingyu Chen, Justin Zobel, and Karin Verspoor. 2017. Duplicates, redundancies, and inconsistencies in the primary nucleotide databases: A descriptive study. Database: The Journal of Biological Databases and Curation 2017, baw163."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0159644"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2011.127"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/gkn238"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/gkw1108"},{"key":"e_1_2_1_20_1","volume-title":"Proceedings of the International Workshop on Semantic Web Applications and Tools for Life Sciences (SWAT4LS\u201915)","author":"Courtot M\u00e9lanie","year":"2015","unstructured":"M\u00e9lanie Courtot , Aleksandra Shypitsyna , Elena Speretta , Alexander Holmes , Tony Sawford , Tony Wardell , Maria Jesus Martin , and Claire O\u2019Donovan . 2015 . UniProt-GOA: A central resource for data integration and GO annotation . In Proceedings of the International Workshop on Semantic Web Applications and Tools for Life Sciences (SWAT4LS\u201915) . 227--228. M\u00e9lanie Courtot, Aleksandra Shypitsyna, Elena Speretta, Alexander Holmes, Tony Sawford, Tony Wardell, Maria Jesus Martin, and Claire O\u2019Donovan. 2015. UniProt-GOA: A central resource for data integration and GO annotation. In Proceedings of the International Workshop on Semantic Web Applications and Tools for Life Sciences (SWAT4LS\u201915). 227--228."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1186\/2041-1480-2-5"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1037\/h0029393"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2063576.2063904"},{"key":"e_1_2_1_24_1","volume-title":"The Gene Ontology Handbook. Methods in Molecular Biology","author":"Dessimoz Christophe","unstructured":"Christophe Dessimoz and Nives \u0160kunca . 2016. The Gene Ontology Handbook. Methods in Molecular Biology . Springer . Christophe Dessimoz and Nives \u0160kunca. 2016. The Gene Ontology Handbook. Methods in Molecular Biology. Springer."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1162\/COLI_a_00148"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.2174\/092986609787848045"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btq461"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1093\/cercor\/bhu250"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2007.9"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2854006.2854008"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/gkv1344"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/bts565"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/gkw1188"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.14778\/2367502.2367564"},{"key":"e_1_2_1_35_1","volume-title":"Data Mining: Concepts and Techniques","author":"Han Jiawei","year":"2011","unstructured":"Jiawei Han , Jian Pei , and Micheline Kamber . 2011 . Data Mining: Concepts and Techniques . Elsevier . Jiawei Han, Jian Pei, and Micheline Kamber. 2011. Data Mining: Concepts and Techniques. Elsevier."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/bti517"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2016.2610324"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.1501049112"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.4137\/EBO.S8681"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btq003"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2008.4630070"},{"key":"e_1_2_1_42_1","volume-title":"Proceedings of the 24th Australasian Conference on Information Systems (ACIS\u201913)","author":"Jayawardene Vimukthi","year":"2013","unstructured":"Vimukthi Jayawardene , Shazia Sadiq , and Marta Indulska . 2013 . The curse of dimensionality in data quality . In Proceedings of the 24th Australasian Conference on Information Systems (ACIS\u201913) . 1--11. Vimukthi Jayawardene, Shazia Sadiq, and Marta Indulska. 2013. The curse of dimensionality in data quality. In Proceedings of the 24th Australasian Conference on Information Systems (ACIS\u201913). 1--11."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1186\/1471-2164-10-263"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1089\/cmb.2008.0236"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1146\/annurev-statistics-060116-054114"},{"key":"e_1_2_1_46_1","volume-title":"Fr\u00e9d\u00e9ric Mah\u00e9, Yan He, Hong-Wei Zhou, Torbj\u00f8rn Rognes, J. Gregory Caporaso, and Rob Knight.","author":"Kopylova Evguenia","year":"2016","unstructured":"Evguenia Kopylova , Jose A. Navas-Molina , C\u00e9line Mercier , Zhenjiang Zech Xu , Fr\u00e9d\u00e9ric Mah\u00e9, Yan He, Hong-Wei Zhou, Torbj\u00f8rn Rognes, J. Gregory Caporaso, and Rob Knight. 2016 . Open-source sequence clustering methods improve the state of the art. mSystems 1, 1, e00003--15. Evguenia Kopylova, Jose A. Navas-Molina, C\u00e9line Mercier, Zhenjiang Zech Xu, Fr\u00e9d\u00e9ric Mah\u00e9, Yan He, Hong-Wei Zhou, Torbj\u00f8rn Rognes, J. Gregory Caporaso, and Rob Knight. 2016. Open-source sequence clustering methods improve the state of the art. mSystems 1, 1, e00003--15."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/24.2.316"},{"key":"e_1_2_1_48_1","doi-asserted-by":"crossref","first-page":"121","DOI":"10.3233\/ISB-00350","article-title":"COPid: Composition based protein identification","volume":"8","author":"Kumar Manish","year":"2008","unstructured":"Manish Kumar , Varun Thakur , and Gajendra P. S. Raghava . 2008 . COPid: Composition based protein identification . In Silico Biology 8 , 2, 121 -- 128 . Manish Kumar, Varun Thakur, and Gajendra P. S. Raghava. 2008. COPid: Composition based protein identification. In Silico Biology 8, 2, 121--128.","journal-title":"Silico Biology"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/gkn808"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btl158"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/17.3.282"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/18.1.77"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/2501654.2501658"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2010.35"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/1242572.1242592"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.5555\/2008664.2008669"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btv590"},{"key":"e_1_2_1_58_1","volume-title":"Mulder","author":"Mazandu Gaston K.","year":"2016","unstructured":"Gaston K. Mazandu , Emile R. Chimusa , and Nicola J . Mulder . 2016 . Gene ontology semantic similarity tools: Survey on features and challenges for biological knowledge discovery. Briefings in Bioinformatics 2016, bbw067. Gaston K. Mazandu, Emile R. Chimusa, and Nicola J. Mulder. 2016. Gene ontology semantic similarity tools: Survey on features and challenges for biological knowledge discovery. Briefings in Bioinformatics 2016, bbw067."},{"key":"e_1_2_1_59_1","volume-title":"Mulder","author":"Mazandu Gaston K.","year":"2013","unstructured":"Gaston K. Mazandu and Nicola J . Mulder . 2013 . Information content-based gene ontology semantic similarity approaches: Toward a unified framework theory. BioMed Research International 2013, Article Nol . 292063. Gaston K. Mazandu and Nicola J. Mulder. 2013. Information content-based gene ontology semantic similarity approaches: Toward a unified framework theory. BioMed Research International 2013, Article Nol. 292063."},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0113859"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/bti797"},{"key":"e_1_2_1_62_1","volume-title":"Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Research","author":"Mirdita Milot","year":"2016","unstructured":"Milot Mirdita , Lars von den Driesch , Clovis Galiez , Maria J. Martin , Johannes S\u00f6ding , and Martin Steinegger . 2016. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Research 2016 , gkw1081. Milot Mirdita, Lars von den Driesch, Clovis Galiez, Maria J. Martin, Johannes S\u00f6ding, and Martin Steinegger. 2016. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Research 2016, gkw1081."},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1186\/1471-2105-9-327"},{"key":"e_1_2_1_64_1","volume-title":"Proceedings of the International Conference on Information Quality. 269--284","author":"M\u00fcller Heiko","year":"2003","unstructured":"Heiko M\u00fcller , Felix Naumann , and Johann-Christoph Freytag . 2003 . Data quality in genome databases . In Proceedings of the International Conference on Information Quality. 269--284 . Heiko M\u00fcller, Felix Naumann, and Johann-Christoph Freytag. 2003. Data quality in genome databases. In Proceedings of the International Conference on Information Quality. 269--284."},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1016\/0022-2836(70)90057-4"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1186\/1471-2105-11-187"},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1186\/1471-2105-9-S5-S4"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pcbi.1000443"},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.polymer.2007.07.039"},{"key":"e_1_2_1_70_1","volume-title":"Expert curation in UniProtKB: A case study on dealing with conflicting and erroneous data. Database","author":"Poux Sylvain","year":"2014","unstructured":"Sylvain Poux , Michele Magrane , Cecilia N. Arighi , Alan Bridge , Claire O\u2019Donovan , Kati Laiho; UniProt Consortium . 2014. Expert curation in UniProtKB: A case study on dealing with conflicting and erroneous data. Database 2014 , bau016. Sylvain Poux, Michele Magrane, Cecilia N. Arighi, Alan Bridge, Claire O\u2019Donovan, Kati Laiho; UniProt Consortium. 2014. Expert curation in UniProtKB: A case study on dealing with conflicting and erroneous data. Database 2014, bau016."},{"key":"e_1_2_1_71_1","volume-title":"Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR\u201915)","author":"Rekatsinas Theodoros","year":"2015","unstructured":"Theodoros Rekatsinas , Xin Luna Dong , Lise Getoor , and Divesh Srivastava . 2015 . Finding quality in quantity: The challenge of discovering valuable sources for integration . In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR\u201915) . Theodoros Rekatsinas, Xin Luna Dong, Lise Getoor, and Divesh Srivastava. 2015. Finding quality in quantity: The challenge of discovering valuable sources for integration. In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR\u201915)."},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1039\/C5NR08944A"},{"key":"e_1_2_1_73_1","volume-title":"Bastian","author":"Rosikiewicz Marta","year":"2013","unstructured":"Marta Rosikiewicz , Aur\u00e9lie Comte , Anne Niknejad , Marc Robinson-Rechavi , and Frederic B . Bastian . 2013 . Uncovering hidden duplicated content in public transcriptomics data. Database 2013, bat010. Marta Rosikiewicz, Aur\u00e9lie Comte, Anne Niknejad, Marc Robinson-Rechavi, and Frederic B. Bastian. 2013. Uncovering hidden duplicated content in public transcriptomics data. Database 2013, bat010."},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2014.6816764"},{"key":"e_1_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.2741\/1627"},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1561\/1500000040"},{"key":"e_1_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1186\/1756-0500-7-249"},{"key":"e_1_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1128\/AEM.02810-10"},{"key":"e_1_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0017288"},{"key":"e_1_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pcbi.1000605"},{"key":"e_1_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.1109\/34.868688"},{"key":"e_1_2_1_82_1","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/gkl893"},{"key":"e_1_2_1_83_1","doi-asserted-by":"publisher","DOI":"10.6026\/97320630005234"},{"key":"e_1_2_1_84_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btm098"},{"key":"e_1_2_1_85_1","doi-asserted-by":"publisher","DOI":"10.1186\/1471-2105-7-213"},{"key":"e_1_2_1_86_1","doi-asserted-by":"publisher","DOI":"10.1186\/1471-2105-13-40"},{"key":"e_1_2_1_87_1","doi-asserted-by":"publisher","DOI":"10.1186\/1471-2105-9-310"},{"key":"e_1_2_1_88_1","volume-title":"UniProt: A hub for protein information. Nucleic Acids Research","author":"UniProt Consortium","year":"2014","unstructured":"UniProt Consortium . 2014. UniProt: A hub for protein information. Nucleic Acids Research 2014 , gku989. UniProt Consortium. 2014. UniProt: A hub for protein information. Nucleic Acids Research 2014, gku989."},{"key":"e_1_2_1_89_1","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/gku469"},{"key":"e_1_2_1_90_1","doi-asserted-by":"publisher","DOI":"10.1016\/0306-4573(88)90027-1"},{"key":"e_1_2_1_91_1","doi-asserted-by":"publisher","DOI":"10.1145\/2000824.2000825"},{"key":"e_1_2_1_92_1","doi-asserted-by":"publisher","DOI":"10.1002\/jcc.21163"},{"key":"e_1_2_1_93_1","volume-title":"Wagner Meira Jr., and Wagner Meira","author":"Zaki Mohammed J.","year":"2014","unstructured":"Mohammed J. Zaki , Wagner Meira Jr., and Wagner Meira . 2014 . Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press . Mohammed J. Zaki, Wagner Meira Jr., and Wagner Meira. 2014. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press."},{"key":"e_1_2_1_94_1","doi-asserted-by":"publisher","DOI":"10.4172\/jpb.1000165"},{"key":"e_1_2_1_95_1","volume-title":"Starcode: Sequence clustering based on all-pairs search. Bioinformatics","author":"Zorita Eduard Valera","year":"2015","unstructured":"Eduard Valera Zorita , Pol Cusc\u00f3 , and Guillaume Filion . 2015 . Starcode: Sequence clustering based on all-pairs search. Bioinformatics 2015, btv053. Eduard Valera Zorita, Pol Cusc\u00f3, and Guillaume Filion. 2015. Starcode: Sequence clustering based on all-pairs search. Bioinformatics 2015, btv053."}],"container-title":["Journal of Data and Information Quality"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3131611","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3131611","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,30]],"date-time":"2025-06-30T02:37:20Z","timestamp":1751251040000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3131611"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,9,30]]},"references-count":95,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2017,9,30]]}},"alternative-id":["10.1145\/3131611"],"URL":"https:\/\/doi.org\/10.1145\/3131611","relation":{},"ISSN":["1936-1955","1936-1963"],"issn-type":[{"value":"1936-1955","type":"print"},{"value":"1936-1963","type":"electronic"}],"subject":[],"published":{"date-parts":[[2017,9,30]]},"assertion":[{"value":"2017-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-07-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-01-27","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}