{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,12]],"date-time":"2026-03-12T01:25:01Z","timestamp":1773278701224,"version":"3.50.1"},"reference-count":61,"publisher":"Springer Science and Business Media LLC","issue":"1","content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2007,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:sec>\n            <jats:title>Background<\/jats:title>\n            <jats:p>Similarity of sequences is a key mathematical notion for Classification and Phylogenetic studies in Biology. It is currently primarily handled using alignments. However, the alignment methods seem inadequate for post-genomic studies since they do not scale well with data set size and they seem to be confined only to genomic and proteomic sequences. Therefore, alignment-free similarity measures are actively pursued. Among those, USM (Universal Similarity Metric) has gained prominence. It is based on the deep theory of Kolmogorov Complexity and <jats:italic>universality<\/jats:italic> is its most novel striking feature. Since it can only be approximated via data compression, USM is a methodology rather than a formula quantifying the similarity of two strings. Three approximations of USM are available, namely UCD (Universal Compression Dissimilarity), NCD (Normalized Compression Dissimilarity) and CD (Compression Dissimilarity). Their applicability and robustness is tested on various data sets yielding a first massive quantitative estimate that the USM methodology and its approximations are of value. Despite the rich theory developed around USM, its experimental assessment has limitations: only a few data compressors have been tested in conjunction with USM and mostly at a qualitative level, no comparison among UCD, NCD and CD is available and no comparison of USM with existing methods, both based on alignments and not, seems to be available.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Results<\/jats:title>\n            <jats:p>We experimentally test the USM methodology by using 25 compressors, all three of its known approximations and six data sets of relevance to Molecular Biology. This offers the first systematic and quantitative experimental assessment of this methodology, that naturally complements the many theoretical and the preliminary experimental results available. Moreover, we compare the USM methodology both with methods based on alignments and not. We may group our experiments into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the <jats:italic>intrinsic<\/jats:italic> ability of the methodology to discriminate and classify biological sequences and structures. A second set of experiments aims at assessing how well two commonly available classification algorithms, UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and NJ (Neighbor Joining), can use the methodology to perform their task, their performance being evaluated against gold standards and with the use of well known statistical indexes, i.e., the F-measure and the partition distance. Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of USM on biological data. The main ones are reported next.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Conclusion<\/jats:title>\n            <jats:p>UCD and NCD are indistinguishable, i.e., they yield nearly the same values of the statistical indexes we have used, accross experiments and data sets, while CD is almost always worse than both. UPGMA seems to yield better classification results with respect to NJ, i.e., better values of the statistical indexes (10% difference or above), on a substantial fraction of experiments, compressors and USM approximation choices. The compression program PPMd, based on PPM (Prediction by Partial Matching), for generic data and Gencompress for DNA, are the best performers among the compression algorithms we have used, although the difference in performance, as measured by statistical indexes, between them and the other algorithms depends critically on the data set and may not be as large as expected. PPMd used with UCD or NCD and UPGMA, on sequence data is very close, although worse, in performance with the alignment methods (less than 2% difference on the F-measure). Yet, it scales well with data set size and it can work on data other than sequences. In summary, our quantitative analysis naturally complements the rich theory behind USM and supports the conclusion that the methodology is worth using because of its robustness, flexibility, scalability, and competitiveness with existing techniques. In particular, the methodology applies to all biological data in textual format. The software and data sets are available under the GNU GPL at the supplementary material web page.<\/jats:p>\n          <\/jats:sec>","DOI":"10.1186\/1471-2105-8-252","type":"journal-article","created":{"date-parts":[[2007,7,13]],"date-time":"2007-07-13T18:13:46Z","timestamp":1184350426000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":90,"title":["Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment"],"prefix":"10.1186","volume":"8","author":[{"given":"Paolo","family":"Ferragina","sequence":"first","affiliation":[]},{"given":"Raffaele","family":"Giancarlo","sequence":"additional","affiliation":[]},{"given":"Valentina","family":"Greco","sequence":"additional","affiliation":[]},{"given":"Giovanni","family":"Manzini","sequence":"additional","affiliation":[]},{"given":"Gabriel","family":"Valiente","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2007,7,13]]},"reference":[{"key":"1624_CR1","unstructured":"Kolmogorov Library Supplementary Material Web Page. [http:\/\/www.math.unipa.it\/~raffaele\/kolmogorov\/]"},{"key":"1624_CR2","volume-title":"Time Wraps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison","year":"1983","unstructured":"Kruskal J, Sankoff D, (Eds): Time Wraps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. 1983, Addison-Wesley"},{"key":"1624_CR3","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4899-6846-3","volume-title":"Introduction to Computational Biology. Maps, Sequences and Genomes","author":"M Waterman","year":"1995","unstructured":"Waterman M: Introduction to Computational Biology. Maps, Sequences and Genomes. 1995, Chapman Hall"},{"key":"1624_CR4","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511574931","volume-title":"Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology","author":"D Gusfield","year":"1997","unstructured":"Gusfield D: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. 1997, Cambridge University Press"},{"issue":"4","key":"1624_CR5","doi-asserted-by":"publisher","first-page":"513","DOI":"10.1093\/bioinformatics\/btg005","volume":"19","author":"S Vinga","year":"2003","unstructured":"Vinga S, Almeida J: Alignment-Free Sequence Comparison: A Review. Bioinformatics. 2003, 19 (4): 513-523.","journal-title":"Bioinformatics"},{"issue":"5","key":"1624_CR6","doi-asserted-by":"publisher","first-page":"465","DOI":"10.1016\/0005-1098(78)90005-5","volume":"14","author":"J Rissanen","year":"1978","unstructured":"Rissanen J: Modeling by shortest data description. Automatica. 1978, 14 (5): 465-471.","journal-title":"Automatica"},{"issue":"12","key":"1624_CR7","doi-asserted-by":"publisher","first-page":"3250","DOI":"10.1109\/TIT.2004.838101","volume":"50","author":"M Li","year":"2004","unstructured":"Li M, Chen X, Li X, Ma B, Vit\u00e1nyi PMB: The Similarity Metric. IEEE T. Inform. Theory. 2004, 50 (12): 3250-3264.","journal-title":"IEEE T. Inform. Theory"},{"key":"1624_CR8","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4757-2606-0","volume-title":"An Introduction to Kolmogorov Complexity and its Applications","author":"M Li","year":"1997","unstructured":"Li M, Vit\u00e1nyi PMB: An Introduction to Kolmogorov Complexity and its Applications. 1997, Springer-Verlag, 2","edition":"2"},{"issue":"4","key":"1624_CR9","doi-asserted-by":"publisher","first-page":"1523","DOI":"10.1109\/TIT.2005.844059","volume":"51","author":"R Cilibrasi","year":"2005","unstructured":"Cilibrasi R, Vit\u00e1nyi PMB: Clustering by Compression. IEEE T. Inform. Theory. 2005, 51 (4): 1523-1545.","journal-title":"IEEE T. Inform. Theory"},{"key":"1624_CR10","first-page":"206","volume-title":"Proc. 10th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, ACM","author":"E Keogh","year":"2004","unstructured":"Keogh E, Lonardi S, Rtanamahata C: Towards parameter-free data mining. Proc. 10th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, ACM. 2004, 206-215."},{"key":"1624_CR11","first-page":"175","volume-title":"Proc. 11th ACM-SIAM Symp. Discrete Algorithms","author":"AL Buchsbaum","year":"2000","unstructured":"Buchsbaum AL, Caldwell DF, Church KW, Fowler GS, Muthukrishnan S: Engineering the Compression of Massive Tables: An Experimental Approach. Proc. 11th ACM-SIAM Symp. Discrete Algorithms. 2000, 175-184."},{"issue":"6","key":"1624_CR12","doi-asserted-by":"publisher","first-page":"825","DOI":"10.1145\/950620.950622","volume":"50","author":"AL Buchsbaum","year":"2003","unstructured":"Buchsbaum AL, Fowler GS, Giancarlo R: Improving Table Compression with Combinatorial Optimization. J ACM. 2003, 50 (6): 825-851.","journal-title":"J ACM"},{"issue":"7","key":"1624_CR13","doi-asserted-by":"publisher","first-page":"1015","DOI":"10.1093\/bioinformatics\/bth031","volume":"20","author":"N Krasnogor","year":"2004","unstructured":"Krasnogor N, Pelta DA: Measuring the Similarity of Protein Structures by Means of the Universal Similarity Metric. Bioinformatics. 2004, 20 (7): 1015-1021.","journal-title":"Bioinformatics"},{"key":"1624_CR14","first-page":"1124","volume-title":"Proc. 4th Conf. European Society for Fuzzy Logic and Technology and 11 Rencontres Francophones sur la Logique Floue et ses Applications (EUSFLAT-LFA, 2005)","author":"D Pelta","year":"2005","unstructured":"Pelta D, Gonzales JR, Krasnogor N: Protein Structure Comparison through Fuzzy Contact Maps and the Universal Similarity Metric. Proc. 4th Conf. European Society for Fuzzy Logic and Technology and 11 Rencontres Francophones sur la Logique Floue et ses Applications (EUSFLAT-LFA, 2005). 2005, 1124-1129."},{"key":"1624_CR15","first-page":"177","volume-title":"London Algorithmics and Stringology 2006","author":"D Gilbert","year":"2007","unstructured":"Gilbert D, Rossell\u00f3 F, Valiente G, Veeramalai M: Alignment-Free Comparison of TOPS Strings. London Algorithmics and Stringology 2006. Edited by: Daykin J, Mohamed M, Steinh\u00f6fel K. 2007, College Publications, 8: 177-197."},{"key":"1624_CR16","doi-asserted-by":"publisher","first-page":"115","DOI":"10.1007\/s00453-003-1045-2","volume":"38","author":"LP Chew","year":"2003","unstructured":"Chew LP, Kedem K: Finding the Consensus Shape for a Protein Family. Algorithmica. 2003, 38: 115-129.","journal-title":"Algorithmica"},{"issue":"3","key":"1624_CR17","doi-asserted-by":"publisher","first-page":"773","DOI":"10.1110\/ps.03328504","volume":"13","author":"M Sierk","year":"2004","unstructured":"Sierk M, Person W: Sensitivity and Selectivity in Protein Structure Comparison. Protein Sci. 2004, 13 (3): 773-785.","journal-title":"Protein Sci"},{"key":"1624_CR18","doi-asserted-by":"publisher","first-page":"12","DOI":"10.1186\/1472-6807-5-12","volume":"5","author":"B Thiruv","year":"2005","unstructured":"Thiruv B, Quon G, Saldanha SA, Steipe B: Nh3D: A Reference Dataset of Non-Homologous Protein Structures. BMC Struct Biol. 2005, 5: 12-","journal-title":"BMC Struct Biol"},{"issue":"4","key":"1624_CR19","doi-asserted-by":"publisher","first-page":"1711","DOI":"10.1006\/jmbi.1998.2400","volume":"285","author":"JM Word","year":"1999","unstructured":"Word JM, Lovell SC, LaBean TH, Taylor HC, Zalis ME, Presley BK, Richardson JS, Richardson DC: Visualizing and Quantifying Molecular Goodness-of-Fit: Small-Probe Contact Dots with Explicit Hydrogen Atoms. J Mol Biol. 1999, 285 (4): 1711-1733.","journal-title":"J Mol Biol"},{"issue":"D","key":"1624_CR20","first-page":"D247","volume":"33","author":"F Pearl","year":"2005","unstructured":"Pearl F: The CATH Domain Structure Database and Related Resources Gene3D and DHS Provide Comprehensive Domain Family Information for Genome Analysis. Nucleic Acids Res. 2005, 33 (D): D247-D251.","journal-title":"Nucleic Acids Res"},{"issue":"8","key":"1624_CR21","doi-asserted-by":"publisher","first-page":"2444","DOI":"10.1073\/pnas.85.8.2444","volume":"85","author":"WR Pearson","year":"1998","unstructured":"Pearson WR, Lipman DJ: Improved Tools for Biological Sequence Comparison. Proc Natl Acad Sci USA. 1998, 85 (8): 2444-2448.","journal-title":"Proc Natl Acad Sci USA"},{"key":"1624_CR22","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1093\/protein\/7.1.31","volume":"7","author":"TP Flores","year":"1994","unstructured":"Flores TP, Moss DM, Thornton JM: An Algorithm for Automatically Generating Protein Topology Cartoons. Protein Eng Des Sel. 1994, 7: 31-37.","journal-title":"Protein Eng Des Sel"},{"key":"1624_CR23","doi-asserted-by":"publisher","first-page":"317","DOI":"10.1093\/bioinformatics\/15.4.317","volume":"15","author":"DR Gilbert","year":"1999","unstructured":"Gilbert DR, Westhead DR, Nagano N, Thornton JM: Motif-Based Searching in TOPS Protein Topology Databases. Bioinformatics. 1999, 15: 317-326.","journal-title":"Bioinformatics"},{"key":"1624_CR24","doi-asserted-by":"publisher","first-page":"35","DOI":"10.1016\/S0968-0004(97)01161-4","volume":"23","author":"DR Westhead","year":"1998","unstructured":"Westhead DR, Hutton DC, Thornton JM: An Atlas of Protein Topology Cartoons Available on the World Wide Web. Trends Biochem Sci. 1998, 23: 35-36.","journal-title":"Trends Biochem Sci"},{"issue":"4","key":"1624_CR25","doi-asserted-by":"publisher","first-page":"897","DOI":"10.1110\/ps.8.4.897","volume":"8","author":"DR Westhead","year":"1999","unstructured":"Westhead DR, Slidel T, Flores T, Thornton JM: Protein Structural Topology: Automated Analysis and Diagrammatic Representations. Protein Sci. 1999, 8 (4): 897-904.","journal-title":"Protein Sci"},{"issue":"12","key":"1624_CR26","doi-asserted-by":"publisher","first-page":"2577","DOI":"10.1002\/bip.360221211","volume":"22","author":"W Kabsch","year":"1983","unstructured":"Kabsch W, Sander C: Dictionary of Protein Secondary Structure: Pattern Recognition of Hydrogen-Bonded and Geometrical Features. Biopolymers. 1983, 22 (12): 2577-2637.","journal-title":"Biopolymers"},{"key":"1624_CR27","doi-asserted-by":"publisher","first-page":"63","DOI":"10.1016\/0167-4838(91)90093-F","volume":"1078","author":"F Mauri","year":"1991","unstructured":"Mauri F, Omnaas J, Davidson L, Whitfill C, Kitto GB: Amino acid sequence of a globin from the sea cucumber Caudina (Molpadia) arenicola. Biochimica et Biophysica Acta. 1991, 1078: 63-67.","journal-title":"Biochimica et Biophysica Acta"},{"key":"1624_CR28","doi-asserted-by":"publisher","first-page":"29","DOI":"10.1007\/BF01025089","volume":"11","author":"GD McDonald","year":"1992","unstructured":"McDonald GD, Davidson L, Kitto GB: Amino acid sequence of the coelomic C globin from the sea cucumber Caudina (Molpadia) arenicola. J Protein Chem. 1992, 11: 29-37.","journal-title":"J Protein Chem"},{"key":"1624_CR29","doi-asserted-by":"publisher","first-page":"4","DOI":"10.1186\/1748-7188-1-4","volume":"1","author":"A Apostolico","year":"2006","unstructured":"Apostolico A, Comin M, Parida L: Mining, Compressing and Classifying with Extensible Motifs. Algorithms Mol Biol. 2006, 1: 4-","journal-title":"Algorithms Mol Biol"},{"issue":"20","key":"1624_CR30","doi-asserted-by":"publisher","first-page":"3940","DOI":"10.1093\/bioinformatics\/bti623","volume":"21","author":"T Sing","year":"2005","unstructured":"Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR: Visualizing Classifier Performance in R. Bioinformatics. 2005, 21 (20): 3940-3941.","journal-title":"Bioinformatics"},{"key":"1624_CR31","volume-title":"Numerical Taxonomy: The Principles and Practice of Numerical Classification","author":"PHA Sneath","year":"1973","unstructured":"Sneath PHA, Sokal RR: Numerical Taxonomy: The Principles and Practice of Numerical Classification. 1973, San Francisco: W. H. Freeman"},{"issue":"4","key":"1624_CR32","first-page":"406","volume":"4","author":"N Saitou","year":"1987","unstructured":"Saitou N, Nei M: The Neighbor-Joining Method: A New Method for Reconstructing Phylogenetic Trees. Mol Biol Evol. 1987, 4 (4): 406-425.","journal-title":"Mol Biol Evol"},{"issue":"10","key":"1624_CR33","doi-asserted-by":"publisher","first-page":"1611","DOI":"10.1101\/gr.361602","volume":"12","author":"JE Stajich","year":"2002","unstructured":"Stajich JE: The BioPerl Toolkit: Perl Modules for the Life Sciences. Genome Res. 2002, 12 (10): 1611-1618. [http:\/\/www.bioperl.org]","journal-title":"Genome Res"},{"issue":"15","key":"1624_CR34","doi-asserted-by":"publisher","first-page":"3201","DOI":"10.1093\/bioinformatics\/bti517","volume":"21","author":"J Handl","year":"2005","unstructured":"Handl J, Knowles J, Kell DB: Computational Cluster Validation in Post-Genomic Data Analysis. Bioinformatics. 2005, 21 (15): 3201-3212.","journal-title":"Bioinformatics"},{"issue":"4","key":"1624_CR35","first-page":"536","volume":"247","author":"AG Murzin","year":"1995","unstructured":"Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: A Structural Classification of Proteins Database for the Investigation of Sequences and Structures. J Mol Biol. 1995, 247 (4): 536-540.","journal-title":"J Mol Biol"},{"issue":"10","key":"1624_CR36","doi-asserted-by":"publisher","first-page":"2150","DOI":"10.1110\/ps.0306803","volume":"12","author":"R Day","year":"2003","unstructured":"Day R, Beck DA, Armen RS, Daggett V: A consensus view of fold space: Combining SCOP, CATH, and the Dali Domain Dictionary. Protein Sci. 2003, 12 (10): 2150-2160.","journal-title":"Protein Sci"},{"issue":"9","key":"1624_CR37","doi-asserted-by":"publisher","first-page":"1099","DOI":"10.1016\/S0969-2126(99)80177-4","volume":"7","author":"C Hadley","year":"1999","unstructured":"Hadley C, Jones DT: A Systematic Comparison of Protein Structure Classifications: SCOP, CATH and FSSP. Structure. 1999, 7 (9): 1099-1112.","journal-title":"Structure"},{"key":"1624_CR38","doi-asserted-by":"publisher","first-page":"10","DOI":"10.1093\/nar\/28.1.10","volume":"28","author":"DL Wheeler","year":"2000","unstructured":"Wheeler DL, Chappey C, Lash AE, Leipe DD, Madden TL, Schuler GD, Tatusova TA, Rapp BA: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2000, 28: 10-14. [http:\/\/www.ncbi.nlm.nih.gov\/Taxonomy\/]","journal-title":"Nucleic Acids Res"},{"issue":"5765","key":"1624_CR39","doi-asserted-by":"publisher","first-page":"1283","DOI":"10.1126\/science.1123061","volume":"311","author":"FD Ciccarelli","year":"2006","unstructured":"Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P: Toward Automatic Reconstruction of a Highly Resolved Tree of Life. Science. 2006, 311 (5765): 1283-1287.","journal-title":"Science"},{"key":"1624_CR40","first-page":"1","volume":"1","author":"AN Kolmogorov","year":"1965","unstructured":"Kolmogorov AN: Three Approaches to the Quantitative Definition of Information. Probl Inform Transm. 1965, 1: 1-7.","journal-title":"Probl Inform Transm"},{"issue":"7","key":"1624_CR41","doi-asserted-by":"publisher","first-page":"1407","DOI":"10.1109\/18.681318","volume":"44","author":"CH Bennett","year":"1998","unstructured":"Bennett CH, G\u00e1cs P, Li M, Vit\u00e1nyi PMB, Zurek W: Information Distance. IEEE T. Inform. Theory. 1998, 44 (7): 1407-1423.","journal-title":"IEEE T. Inform. Theory"},{"key":"1624_CR42","volume-title":"Elements of Information Theory","author":"TM Cover","year":"1990","unstructured":"Cover TM, Thomas JA: Elements of Information Theory. 1990, Wiley"},{"issue":"3","key":"1624_CR43","doi-asserted-by":"publisher","first-page":"337","DOI":"10.1109\/TIT.1977.1055714","volume":"23","author":"J Ziv","year":"1977","unstructured":"Ziv J, Lempel A: A universal algorithm for sequential data compression. IEEE T. Inform. Theory. 1977, 23 (3): 337-343.","journal-title":"IEEE T. Inform. Theory"},{"key":"1624_CR44","volume-title":"Tech. Rep. 124, Digital Equipment Corporation","author":"M Burrows","year":"1994","unstructured":"Burrows M, Wheeler D: A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation. 1994"},{"key":"1624_CR45","first-page":"202","volume-title":"IEEE Data Compression Conference, IEEE","author":"D Shkarin","year":"2002","unstructured":"Shkarin D: PPM: One step to practicality. IEEE Data Compression Conference, IEEE. 2002, 202-211."},{"key":"1624_CR46","volume-title":"PPMd Compressor Ver. J","author":"D Shkarin","year":"2006","unstructured":"Shkarin D: PPMd Compressor Ver. J. 2006, [http:\/\/www.compression.ru\/ds\/]"},{"issue":"6","key":"1624_CR47","doi-asserted-by":"publisher","first-page":"520","DOI":"10.1145\/214762.214771","volume":"30","author":"IH Witten","year":"1987","unstructured":"Witten IH, Neal RM, Cleary JG: Arithmetic coding for data compression. Commun ACM. 1987, 30 (6): 520-540.","journal-title":"Commun ACM"},{"key":"1624_CR48","volume-title":"Carryless Range Coding","author":"M Lundqvist","year":"2006","unstructured":"Lundqvist M: Carryless Range Coding. 2006, [http:\/\/mikaellq.net\/software.htm]"},{"key":"1624_CR49","first-page":"561","volume-title":"Proc. 33rd Int. Coll. Automata, Languages and Programming, of Lecture Notes in Computer Science","author":"P Ferragina","year":"2006","unstructured":"Ferragina P, Giancarlo R, Manzini G: The Myriad Virtues of Wavelet Trees. Proc. 33rd Int. Coll. Automata, Languages and Programming, of Lecture Notes in Computer Science. 2006, Berlin: Springer-Verlag, 4051: 561-572."},{"key":"1624_CR50","first-page":"841","volume-title":"Proc. 14th Annual ACM-SIAM Symp. Discrete Algorithms, ACM","author":"R Grossi","year":"2003","unstructured":"Grossi R, Gupta A, Vitter J: High-Order Entropy-Compressed Text Indexes. Proc. 14th Annual ACM-SIAM Symp. Discrete Algorithms, ACM. 2003, 841-850."},{"issue":"4","key":"1624_CR51","doi-asserted-by":"publisher","first-page":"61","DOI":"10.1109\/51.940049","volume":"20","author":"X Chen","year":"2001","unstructured":"Chen X, Kwong S, Li M: A compression algorithm for DNA sequences. IEEE Engineering in Medicine and Biology Magazine. 2001, 20 (4): 61-66.","journal-title":"IEEE Engineering in Medicine and Biology Magazine"},{"issue":"3","key":"1624_CR52","doi-asserted-by":"publisher","first-page":"443","DOI":"10.1016\/0022-2836(70)90057-4","volume":"48","author":"S Needleman","year":"1970","unstructured":"Needleman S, Wunsch C: A General Method applicable to the Search for Similarities in the Amino Acid Sequence of two Proteins. J Mol Biol. 1970, 48 (3): 443-453.","journal-title":"J Mol Biol"},{"issue":"3","key":"1624_CR53","doi-asserted-by":"publisher","first-page":"403","DOI":"10.1016\/S0022-2836(05)80360-2","volume":"215","author":"SF Altschul","year":"1990","unstructured":"Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic Local Alignment Search Tool. J Mol Biol. 1990, 215 (3): 403-410.","journal-title":"J Mol Biol"},{"key":"1624_CR54","doi-asserted-by":"publisher","first-page":"195","DOI":"10.1016\/0022-2836(81)90087-5","volume":"147","author":"TF Smith","year":"1981","unstructured":"Smith TF, Waterman MS: Identification of Common Molecular Subsequences. J Mol Biol. 1981, 147: 195-197.","journal-title":"J Mol Biol"},{"issue":"22","key":"1624_CR55","doi-asserted-by":"publisher","first-page":"10915","DOI":"10.1073\/pnas.89.22.10915","volume":"89","author":"S Henikoff","year":"1992","unstructured":"Henikoff S, Henikoff JG: Amino Acid Substitution Matrices from Protein Blocks. Proc Natl Acad Sci USA. 1992, 89 (22): 10915-10919.","journal-title":"Proc Natl Acad Sci USA"},{"key":"1624_CR56","volume-title":"pairseqsim: Pairwise Sequence Alignment and Scoring Algorithms for Global, Local and Overlap Alignment with Affine Gap Penalty","author":"W Wolski","year":"2007","unstructured":"Wolski W: pairseqsim: Pairwise Sequence Alignment and Scoring Algorithms for Global, Local and Overlap Alignment with Affine Gap Penalty. 2007, [http:\/\/www.bioconductor.org]"},{"key":"1624_CR57","volume-title":"Structural Approaches to Sequence Evolution: Molecules, Networks, Populations, Biological and Medical Physics, Biomedical Engineering","author":"D Charif","year":"2007","unstructured":"Charif D, Lobry JR: SeqinR 1.0\u20132: A Contributed Package to the R Project for Statistical Computing Devoted to Biological Sequences Retrieval and Analysis. Structural Approaches to Sequence Evolution: Molecules, Networks, Populations, Biological and Medical Physics, Biomedical Engineering. Edited by: Bastolla U, Porto M, Roman HE, Vendruscolo M. 2007, New York: Springer-Verlag"},{"key":"1624_CR58","doi-asserted-by":"publisher","first-page":"404","DOI":"10.1016\/j.jbi.2005.02.008","volume":"38","author":"TA Lasko","year":"2005","unstructured":"Lasko TA, Bhagwat JG, Zou KH, Ohno-Machado L: The Use of Receiver Operating Characteristic Curves in Biomedical Informatics. J Biomed Inform. 2005, 38: 404-415.","journal-title":"J Biomed Inform"},{"key":"1624_CR59","volume-title":"Information Retrieval","author":"CJ van Rijsbergen","year":"1979","unstructured":"van Rijsbergen CJ: Information Retrieval. 1979, London: Butterworths, 2","edition":"2"},{"key":"1624_CR60","doi-asserted-by":"publisher","first-page":"75","DOI":"10.2307\/2413347","volume":"34","author":"D Penny","year":"1985","unstructured":"Penny D, Hendy MD: The Use of Tree Comparison Metrics. Syst Zool. 1985, 34: 75-82.","journal-title":"Syst Zool"},{"key":"1624_CR61","first-page":"119","volume-title":"Proc. 6th Australian Conf. Combinatorial Mathematics, of Lecture Notes Mathematics","author":"DF Robinson","year":"1979","unstructured":"Robinson DF, Foulds LR: Comparison of Weighted Labelled Trees. Proc. 6th Australian Conf. Combinatorial Mathematics, of Lecture Notes Mathematics. 1979, Berlin: Springer-Verlag, 748: 119-126."}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-8-252.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,1]],"date-time":"2021-09-01T01:49:29Z","timestamp":1630460969000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-8-252"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2007,7,13]]},"references-count":61,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2007,12]]}},"alternative-id":["1624"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-8-252","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2007,7,13]]},"assertion":[{"value":"23 May 2007","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 July 2007","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 July 2007","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"252"}}