{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,2]],"date-time":"2026-02-02T20:18:50Z","timestamp":1770063530079,"version":"3.49.0"},"reference-count":40,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,3,7]],"date-time":"2025-03-07T00:00:00Z","timestamp":1741305600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,3,7]],"date-time":"2025-03-07T00:00:00Z","timestamp":1741305600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100002386","name":"Cairo University","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100002386","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>The ability to measure similarity or distance between data points is critical for various analytical tasks, including classification, clustering, and anomaly detection. However, traditional distance metrics such as Euclidean, Manhattan, and Hamming often struggle with mixed data types, varying attribute scales, and noise, limiting their robustness in diverse datasets. This paper introduces the Standard Deviation Score (SD-score), a novel similarity metric designed to address these challenges. By transforming traditional distance values into standard deviation units relative to a target point, the SD-score enables robust and interpretable similarity assessments. Extensive experimental evaluations demonstrate that the SD-score consistently outperforms conventional metrics in accuracy, precision, recall, and F-score within the k-Nearest Neighbors classification framework. Also, a comprehensive evaluation of the SD-score\u2019s performance across Gaussian, skewed, and multimodal distributions showed promising results in the cluster coherence experiment, in which the Silhouette score was measured through the K-means clustering algorithm, emphasizing its adaptability to real-world data complexities. Additionally, the experiments detail improved handling of mixed numerical, ordinal, and categorical data types through a unified framework. The proposed metric incorporates inherent normalization mechanisms, reducing sensitivity to outliers and ensuring consistency across varying data scales and distributions, making it a versatile tool for real-world applications. This advancement in similarity measurement paves the way for more accurate and efficient data analysis across multiple domains.<\/jats:p>","DOI":"10.1186\/s40537-025-01091-z","type":"journal-article","created":{"date-parts":[[2025,3,7]],"date-time":"2025-03-07T07:06:34Z","timestamp":1741331194000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":8,"title":["The Standard Deviation Score: a novel similarity metric for data analysis"],"prefix":"10.1186","volume":"12","author":[{"given":"Osama","family":"Ismael","sequence":"first","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,3,7]]},"reference":[{"key":"1091_CR1","doi-asserted-by":"publisher","DOI":"10.1016\/j.csda.2021.107190.S2CID219260336","volume":"158","author":"F Batool","year":"2021","unstructured":"Batool F, Hennig C. Clustering with the average silhouette width. Comput Stat Data Anal. 2021;158: 107190. https:\/\/doi.org\/10.1016\/j.csda.2021.107190.S2CID219260336.","journal-title":"Comput Stat Data Anal"},{"key":"1091_CR2","doi-asserted-by":"publisher","unstructured":"Biebler K, Wodny M, Jager B. Data mining and metrics on data sets. In: International conference on computational intelligence for modelling, control and automation and international conference on intelligent agents, web technologies and internet commerce (CIMCA-IAWTIC\u201906), Vienna, Austria. p. 638\u201341, https:\/\/doi.org\/10.1109\/CIMCA.2005.1631335.","DOI":"10.1109\/CIMCA.2005.1631335"},{"key":"1091_CR3","unstructured":"Britannica. The Editors of Encyclopaedia. \u201cBinomial coefficients\u201d. Encyclopedia Britannica; 2024. https:\/\/www.britannica.com\/science\/binomial-coefficient. Accessed 10 June 2024."},{"key":"1091_CR4","doi-asserted-by":"publisher","first-page":"314","DOI":"10.1016\/j.neucom.2016.04.073","volume":"249","author":"W Chang","year":"2017","unstructured":"Chang W, Pang L, Tay K. Application of self-organizing map to failure modes and effects analysis methodology. Neurocomputing. 2017;249:314\u201320 (ISSN 0925-2312).","journal-title":"Neurocomputing"},{"issue":"24\u201325","key":"1091_CR5","doi-asserted-by":"publisher","first-page":"2365","DOI":"10.1016\/j.tcs.2009.02.023","volume":"410","author":"S Chen","year":"2009","unstructured":"Chen S, Ma B, Zhang K. On the similarity metric and the distance metric. Theoret Comput Sci. 2009;410(24\u201325):2365\u201376.","journal-title":"Theoret Comput Sci"},{"issue":"1","key":"1091_CR6","first-page":"43","volume":"8","author":"S Choi","year":"2010","unstructured":"Choi S, Cha S, Tappert C. A survey of binary similarity and distance measures. J Syst Cybern Inform. 2010;8(1):43\u20138.","journal-title":"J Syst Cybern Inform"},{"issue":"5\u20136","key":"1091_CR7","doi-asserted-by":"publisher","first-page":"225","DOI":"10.1016\/j.drudis.2007.01.011","volume":"12","author":"H Eckert","year":"2007","unstructured":"Eckert H, Bajorath J. Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches. Drug Discov Today. 2007;12(5\u20136):225\u201333 (ISSN 1359-6446).","journal-title":"Drug Discov Today"},{"issue":"11","key":"1091_CR8","doi-asserted-by":"publisher","first-page":"2372","DOI":"10.1021\/ac970763d","volume":"70","author":"J Egan","year":"1998","unstructured":"Egan J, Morgan L. Outlier detection in multivariate analytical chemical data. J Anal Chem. 1998;70(11):2372\u20139. https:\/\/doi.org\/10.1021\/ac970763d.","journal-title":"J Anal Chem"},{"key":"1091_CR9","first-page":"80","volume-title":"A logical introduction to probability and induction","author":"H Franz","year":"2018","unstructured":"Franz H. A logical introduction to probability and induction. New York: Oxford University Press; 2018. p. 80 (ISBN\u00a09780190845414)."},{"issue":"2","key":"1091_CR10","doi-asserted-by":"publisher","first-page":"211","DOI":"10.1111\/j.2517-6161.1964.tb00553.x","volume":"26","author":"B George","year":"1964","unstructured":"George B, Cox D. An analysis of transformations. J R Stat Soc Ser B. 1964;26(2):211\u201352.","journal-title":"J R Stat Soc Ser B"},{"key":"1091_CR11","doi-asserted-by":"publisher","first-page":"121","DOI":"10.1016\/j.procs.2022.12.014","volume":"215","author":"T Ghongade","year":"2022","unstructured":"Ghongade T, Khobragade R. Pragmatic evaluation of data mining models based on quality assessment & metric analysis. Procedia Comput Sci. 2022;215:121\u201330 (ISSN 1877-0509).","journal-title":"Procedia Comput Sci"},{"issue":"4","key":"1091_CR12","doi-asserted-by":"publisher","first-page":"1713","DOI":"10.1021\/ci060013h","volume":"46","author":"R Guha","year":"2006","unstructured":"Guha R, Dutta D, Jurs P, Chen T. R-NN curves: an intuitive approach to outlier detection using a distance based method. J Chem Inf Model. 2006;46(4):1713\u201322.","journal-title":"J Chem Inf Model"},{"key":"1091_CR13","volume-title":"Data mining: concepts and techniques","author":"J Han","year":"2022","unstructured":"Han J, Pei J, Tong H. Data mining: concepts and techniques. 4th ed. Morgan Kaufmann Publishers, Elsevier; 2022.","edition":"4"},{"key":"1091_CR14","doi-asserted-by":"publisher","first-page":"357","DOI":"10.1038\/s41586-020-2649-2","volume":"585","author":"R Harris","year":"2020","unstructured":"Harris R, Millman J, van der Walt J, et al. Array programming with NumPy. Nature. 2020;585:357\u201362. https:\/\/doi.org\/10.1038\/s41586-020-2649-2.","journal-title":"Nature"},{"key":"1091_CR15","unstructured":"Haug M. Measure of association. Encyclopedia Britannica; 2023. https:\/\/www.britannica.com\/topic\/measure-of-association."},{"key":"1091_CR16","doi-asserted-by":"publisher","first-page":"01","DOI":"10.5121\/ijdkp.2015.5201","volume":"5","author":"M Hossin","year":"2015","unstructured":"Hossin M, Sulaiman M. A review on evaluation metrics for data classification evaluations. Int J Data Min Knowl Manag Process. 2015;5:01\u201311.","journal-title":"Int J Data Min Knowl Manag Process"},{"key":"1091_CR17","doi-asserted-by":"publisher","first-page":"178","DOI":"10.1016\/j.ins.2022.11.139","volume":"622","author":"A Ikotun","year":"2023","unstructured":"Ikotun A, Ezugwu A, Abualigah L, Abuhaija B, Heming J. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Inf Sci. 2023;622:178\u2013210 (ISSN 0020-0255).","journal-title":"Inf Sci"},{"key":"1091_CR18","doi-asserted-by":"publisher","DOI":"10.1201\/9780429341830","volume-title":"Introduction to data science: data analysis and prediction algorithms with R. Chapter 9: visualizing data distributions","author":"A Irizarry","year":"2019","unstructured":"Irizarry A. Introduction to data science: data analysis and prediction algorithms with R. Chapter 9: visualizing data distributions. 1st ed. Chapman and Hall\/CRC; 2019. https:\/\/doi.org\/10.1201\/9780429341830.","edition":"1"},{"issue":"5","key":"1091_CR19","doi-asserted-by":"publisher","first-page":"853","DOI":"10.1109\/TKDE.2018.2848902","volume":"31","author":"S Jian","year":"2019","unstructured":"Jian S, Pang G, Cao L, Lu K, Gao H. CURE: flexible categorical data representation by hierarchical coupling learning. IEEE Trans Knowl Data Eng. 2019;31(5):853\u201366. https:\/\/doi.org\/10.1109\/TKDE.2018.2848902.","journal-title":"IEEE Trans Knowl Data Eng"},{"key":"1091_CR20","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-10531-0","volume-title":"Statistics for data scientists: an introduction to probability, statistics, and data analysis","author":"M Kaptein","year":"2022","unstructured":"Kaptein M, Heuvel E. Statistics for data scientists: an introduction to probability, statistics, and data analysis. Springer Nature; 2022."},{"key":"1091_CR21","unstructured":"Knorr E, Ng R. Finding intensional knowledge of distance-based outliers. In: Proceedings of the 25th international conference on very large data bases; 1999. p. 211\u201322."},{"issue":"3\u20134","key":"1091_CR22","doi-asserted-by":"publisher","first-page":"237","DOI":"10.1007\/s007780050006","volume":"8","author":"E Knorr","year":"2000","unstructured":"Knorr E, Ng R, Tucakov V. Distance-based outliers: algorithms and applications. VLDB J Int J Very Large Data Bases. 2000;8(3\u20134):237\u201353. https:\/\/doi.org\/10.1007\/s007780050006.","journal-title":"VLDB J Int J Very Large Data Bases"},{"issue":"7","key":"1091_CR23","doi-asserted-by":"publisher","first-page":"471","DOI":"10.1016\/S1359-6446(05)03419-7","volume":"10","author":"M Koch","year":"2005","unstructured":"Koch M, Waldmann H. Protein structure similarity clustering and natural product structure as guiding principles in drug discovery. Drug Discov Today. 2005;10(7):471\u201383.","journal-title":"Drug Discov Today"},{"key":"1091_CR24","first-page":"537","volume-title":"The Handbook of Brain Theory and Neural Networks","author":"T Kohonen","year":"1995","unstructured":"Kohonen T. Learning vector quantization. In: Arbib MA, editor. The Handbook of Brain Theory and Neural Networks. Cambridge: MIT Press; 1995. p. 537\u201340."},{"key":"1091_CR25","doi-asserted-by":"publisher","DOI":"10.1016\/j.compag.2020.105507","volume":"174","author":"M Koklu","year":"2020","unstructured":"Koklu M, Ozkan I. Multiclass classification of dry beans using computer vision and machine learning techniques. Comput Electr Agric. 2020;174: 105507 (ISSN 0168-1699).","journal-title":"Comput Electr Agric"},{"key":"1091_CR26","unstructured":"Mehta R, Chormunge S. Pattern matching algorithms: a survey. In: Poonia R, Singh V, Singh Jat D, Div\u00e1n M, Khan M, editors. Proceedings of third international conference on sustainable computing. Advances in intelligent systems and computing, vol 1404. Singapore: Springer; 2022."},{"key":"1091_CR27","doi-asserted-by":"publisher","DOI":"10.1007\/978-0-387-88615-2_4","volume-title":"Data mining in agriculture. Springer optimization and its applications","author":"A Mucherino","year":"2009","unstructured":"Mucherino A, Papajorgji P, Pardalos P. k-Nearest Neighbor classification. In: Data mining in agriculture. Springer optimization and its applications, vol. 34. New York: Springer; 2009. https:\/\/doi.org\/10.1007\/978-0-387-88615-2_4."},{"key":"1091_CR28","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4614-7163-9_141-1","volume-title":"Encyclopedia of social network analysis and mining","author":"M Mulekar","year":"2017","unstructured":"Mulekar M, Brown C. Distance and similarity measures. In: Alhajj R, Rokne J, editors. Encyclopedia of social network analysis and mining. New York: Springer; 2017. https:\/\/doi.org\/10.1007\/978-1-4614-7163-9_141-1."},{"key":"1091_CR29","unstructured":"Rosa T, Primartha R, Wijaya A. Comparison of distance measurement methods on k-nearest neighbor algorithm for classification. In: Sriwijaya international conference on information technology and its applications (SICONIAN), advances in intelligent systems research, vol 172; 2019."},{"key":"1091_CR30","doi-asserted-by":"publisher","first-page":"459","DOI":"10.1016\/B978-0-12-374423-4.00013-6","volume-title":"Chapter 13\u2014application support issues, networked graphics","author":"A Steed","year":"2010","unstructured":"Steed A, Oliveira M. Chapter 13\u2014application support issues, networked graphics. Morgan Kaufmann; 2010. p. 459\u201388. https:\/\/doi.org\/10.1016\/B978-0-12-374423-4.00013-6. (ISBN 9780123744234)."},{"key":"1091_CR31","volume-title":"Book: encyclopedia of analytical chemistry","author":"R Todeschini","year":"2015","unstructured":"Todeschini R, Ballabio D, Consonni V. Distances and other dissimilarity measures in chemometrics. In: Book: encyclopedia of analytical chemistry. John Wiley & Sons Ltd; 2015."},{"key":"1091_CR32","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.aca.2013.04.034","volume":"787","author":"R Todeschini","year":"2013","unstructured":"Todeschini R, Ballabio D, Consonni V, Sahigara F, Filzmoser P. Locally centred Mahalanobis distance: a new distance measure with salient features towards outlier detection. Anal Chim Acta. 2013;787:1\u20139. https:\/\/doi.org\/10.1016\/j.aca.2013.04.034. (ISSN 0003-2670).","journal-title":"Anal Chim Acta"},{"issue":"11","key":"1091_CR33","doi-asserted-by":"publisher","first-page":"2884","DOI":"10.1021\/ci300261r","volume":"52","author":"R Todeschini","year":"2012","unstructured":"Todeschini R, Consonni V, Xiang H, Holliday J, Buscema M, Willett P. Similarity coefficients for binary chemoinformatics data: overview and extended comparison using simulated and real data sets. J Chem Inf Model. 2012;52(11):2884\u2013901. https:\/\/doi.org\/10.1021\/ci300261r.","journal-title":"J Chem Inf Model"},{"issue":"3","key":"1091_CR34","doi-asserted-by":"publisher","first-page":"162","DOI":"10.1177\/00045632211050531","volume":"59","author":"RM West","year":"2022","unstructured":"West RM. Best practice in statistics: The use of log transformation. Ann Clin Biochem. 2022;59(3):162\u20135. https:\/\/doi.org\/10.1177\/00045632211050531.","journal-title":"Ann Clin Biochem"},{"issue":"6","key":"1091_CR35","doi-asserted-by":"publisher","first-page":"983","DOI":"10.1021\/ci9800211","volume":"38","author":"P Willett","year":"1998","unstructured":"Willett P, Barnard J, Downs G. Chemical similarity searching. J Chem Inf Comput Sci. 1998;38(6):983\u201396.","journal-title":"J Chem Inf Comput Sci"},{"issue":"2","key":"1091_CR36","doi-asserted-by":"publisher","first-page":"758","DOI":"10.1109\/TCYB.2020.2983073","volume":"52","author":"Y Zhang","year":"2023","unstructured":"Zhang Y, Cheung Y. A New distance metric exploiting heterogeneous interattribute relationship for ordinal-and-nominal-attribute data clustering. IEEE Trans Cybern. 2023;52(2):758\u201371. https:\/\/doi.org\/10.1109\/TCYB.2020.2983073.","journal-title":"IEEE Trans Cybern"},{"issue":"9","key":"1091_CR37","doi-asserted-by":"publisher","first-page":"6530","DOI":"10.1109\/TNNLS.2022.3202700","volume":"34","author":"Y Zhang","year":"2022","unstructured":"Zhang Y, Cheung Y. Graph-based dissimilarity measurement for cluster analysis of any-type-attributed data. IEEE Trans Neural Netw Learn Syst. 2022;34(9):6530\u201344. https:\/\/doi.org\/10.1109\/TNNLS.2022.3202700.","journal-title":"IEEE Trans Neural Netw Learn Syst"},{"issue":"1","key":"1091_CR38","doi-asserted-by":"publisher","first-page":"39","DOI":"10.1109\/TNNLS.2019.2899381","volume":"31","author":"Y Zhang","year":"2020","unstructured":"Zhang Y, Cheung Y, Tan K. A unified entropy-based distance metric for ordinal-and-nominal-attribute data clustering. IEEE Trans Neural Netw Learn Syst. 2020;31(1):39\u201352. https:\/\/doi.org\/10.1109\/TNNLS.2019.2899381.","journal-title":"IEEE Trans Neural Netw Learn Syst"},{"key":"1091_CR39","doi-asserted-by":"publisher","unstructured":"Zhang Y, Cheung Y, Zeng A. Het2Hom: representation of heterogeneous attributes into homogeneous concept spaces for categorical-and-numerical-attribute data clustering. In: Proceedings of the 31st international joint conference on artificial intelligence; 2022. p. 3758\u201365. https:\/\/doi.org\/10.24963\/ijcai.2022\/522","DOI":"10.24963\/ijcai.2022\/522"},{"issue":"7","key":"1091_CR40","doi-asserted-by":"publisher","first-page":"1254","DOI":"10.1109\/TKDE.2018.2791525","volume":"30","author":"C Zhu","year":"2018","unstructured":"Zhu C, Cao L, Liu Q, Yin J, Kumar V. Heterogeneous metric learning of categorical data with hierarchical couplings. IEEE Trans Knowl Data Eng. 2018;30(7):1254\u201367. https:\/\/doi.org\/10.1109\/TKDE.2018.2791525.","journal-title":"IEEE Trans Knowl Data Eng"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-025-01091-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s40537-025-01091-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-025-01091-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,7]],"date-time":"2025-03-07T07:06:42Z","timestamp":1741331202000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-025-01091-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,7]]},"references-count":40,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["1091"],"URL":"https:\/\/doi.org\/10.1186\/s40537-025-01091-z","relation":{},"ISSN":["2196-1115"],"issn-type":[{"value":"2196-1115","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,3,7]]},"assertion":[{"value":"13 June 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 February 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 March 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"58"}}