{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T14:12:47Z","timestamp":1740147167021,"version":"3.37.3"},"reference-count":25,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2021,7,3]],"date-time":"2021-07-03T00:00:00Z","timestamp":1625270400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,7,3]],"date-time":"2021-07-03T00:00:00Z","timestamp":1625270400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100003006","name":"ETH Zurich","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100003006","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Adv Data Anal Classif"],"published-print":{"date-parts":[[2022,9]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In this paper, we describe the <jats:italic>fingerprint<\/jats:italic> method, a technique to classify bags of mixed-type measurements. The method was designed to solve a real-world industrial problem: classifying industrial plants (individuals at a higher level of organisation) starting from the measurements collected from their production lines (individuals at a lower level of organisation). In this specific application, the categorical information attached to the numerical measurements induced simple mixture-like structures on the global multivariate distributions associated with different classes. The <jats:italic>fingerprint<\/jats:italic> method is designed to compare the mixture components of a given test bag with the corresponding mixture components associated with the different classes, identifying the most similar generating distribution. When compared to other classification algorithms applied to several synthetic data sets and the original industrial data set, the proposed classifier showed remarkable improvements in performance.<\/jats:p>","DOI":"10.1007\/s11634-021-00452-9","type":"journal-article","created":{"date-parts":[[2021,7,3]],"date-time":"2021-07-03T03:35:27Z","timestamp":1625283327000},"page":"617-657","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["A fingerprint of a heterogeneous data set"],"prefix":"10.1007","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1284-4214","authenticated-orcid":false,"given":"Matteo","family":"Spallanzani","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2495-1623","authenticated-orcid":false,"given":"Gueorgui","family":"Mihaylov","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7327-3347","authenticated-orcid":false,"given":"Marco","family":"Prato","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3989-4887","authenticated-orcid":false,"given":"Roberto","family":"Fontana","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2021,7,3]]},"reference":[{"key":"452_CR1","doi-asserted-by":"crossref","unstructured":"Abdullin A, Nasraoui O (2012) Clustering heterogeneous data sets. In: Proceedings of the 2012 Eighth Latin American Web Congress, IEEE","DOI":"10.1109\/LA-WEB.2012.27"},{"key":"452_CR2","doi-asserted-by":"publisher","first-page":"503","DOI":"10.1016\/j.datak.2007.03.016","volume":"63","author":"A Ahmad","year":"2007","unstructured":"Ahmad A, Dey L (2007) A $$k$$-mean clustering algorithm for mixed numeric and categorical data. Data Knowl Eng 63:503\u2013527","journal-title":"Data Knowl Eng"},{"key":"452_CR3","unstructured":"Andrews S, Tsochantaridis I, Hofmann T (2003) Support vector machines for multiple-instance learning. In: Advances in Neural Information Processing Systems 15, Neural Information Processing Systems (NIPS)"},{"key":"452_CR4","unstructured":"Biernacki C, Deregnaucourt T, Kubicki V (2015) Model-based clustering with mixed\/missing data using the new software MixtComp. In: 8th International Conference of the ERCIM WG on Computational and Methodological Statistics (CMStatistics), ERCIM"},{"key":"452_CR5","first-page":"319","volume":"1","author":"WD Blizard","year":"1991","unstructured":"Blizard WD (1991) The development of multiset theory. Modern Logic 1:319\u2013352","journal-title":"Modern Logic"},{"key":"452_CR6","doi-asserted-by":"crossref","unstructured":"Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: 5th Annual Workshop on Computational Learning Theory, ACM","DOI":"10.1145\/130385.130401"},{"key":"452_CR7","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1023\/A:1010933404324","volume":"45","author":"L Breiman","year":"2001","unstructured":"Breiman L (2001) Random forests. Mach Learn 45:5\u201332","journal-title":"Mach Learn"},{"key":"452_CR8","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511801389","volume-title":"An introduction to support vector machines and other Kernel-based learning methods","author":"N Cristianini","year":"2000","unstructured":"Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other Kernel-based learning methods. Cambridge University Press, Cambridge"},{"key":"452_CR9","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1016\/S0004-3702(96)00034-3","volume":"89","author":"TG Dietterich","year":"1997","unstructured":"Dietterich TG, Lathrop LH, Lozano-P\u00e9rez T (1997) Solving the multiple instance problem with axis-parallel rectangles. Artif Intell 89:31\u201371","journal-title":"Artif Intell"},{"key":"452_CR10","doi-asserted-by":"publisher","first-page":"79","DOI":"10.1007\/s10994-013-5429-5","volume":"97","author":"G Doran","year":"2014","unstructured":"Doran G, Ray S (2014) A theoretical and empirical analysis of support vector machine methods for multiple-instance classification. Mach Learn 97:79\u2013102","journal-title":"Mach Learn"},{"key":"452_CR11","doi-asserted-by":"publisher","first-page":"3336","DOI":"10.1016\/j.eswa.2008.01.039","volume":"36","author":"P Hae-Sang","year":"2009","unstructured":"Hae-Sang P, Chi-Hyuck J (2009) A simple and fast algorithm for $$k$$-medoids clustering. Expert Syst Appl 36:3336\u20133341","journal-title":"Expert Syst Appl"},{"key":"452_CR12","doi-asserted-by":"publisher","first-page":"428","DOI":"10.1016\/j.tics.2007.09.004","volume":"11","author":"GE Hinton","year":"2007","unstructured":"Hinton GE (2007) Learning multiple layers of representation. Trends Cognit Sci 11:428\u2013434","journal-title":"Trends Cognit Sci"},{"key":"452_CR13","doi-asserted-by":"crossref","unstructured":"Khoshgoftaar TM, Golawala M, van Hulse\u00a0J (2007) An empirical study of learning from imbalanced data using random forest. In: 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), IEEE","DOI":"10.1109\/ICTAI.2007.46"},{"key":"452_CR14","unstructured":"Kingma DP, Ba JL (2014) Adam: a method for stochastic optimization. CoRR arXiv:1412.6980v9"},{"key":"452_CR15","doi-asserted-by":"crossref","unstructured":"Liu Y, An A, Huang X (2006) Boosting prediction accuracy on imbalanced datasets with SVM ensembles. In: Advances in Knowledge Discovery and Data Mining, Springer","DOI":"10.1007\/11731139_15"},{"key":"452_CR16","first-page":"49","volume":"2","author":"PC Mahalanobis","year":"1936","unstructured":"Mahalanobis PC (1936) On the generalized distance in statistics. Proc National Instit Sci India 2:49\u201355","journal-title":"Proc National Instit Sci India"},{"key":"452_CR17","doi-asserted-by":"publisher","DOI":"10.1002\/0471725293","volume-title":"Discriminant analysis and statistical pattern recognition","author":"GJ McLachlan","year":"1992","unstructured":"McLachlan GJ (1992) Discriminant analysis and statistical pattern recognition. Wiley, New Jersey"},{"key":"452_CR18","first-page":"81","volume":"1","author":"JR Quinlan","year":"1986","unstructured":"Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81\u2013106","journal-title":"Mach Learn"},{"key":"452_CR19","volume-title":"Methods of multivariate analysis","author":"AC Rencher","year":"2003","unstructured":"Rencher AC (2003) Methods of multivariate analysis. Wiley, New Jersey"},{"key":"452_CR20","doi-asserted-by":"publisher","first-page":"533","DOI":"10.1038\/323533a0","volume":"323","author":"DE Rumelhart","year":"1986","unstructured":"Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323:533\u2013536","journal-title":"Nature"},{"key":"452_CR21","doi-asserted-by":"publisher","first-page":"226","DOI":"10.1016\/j.procs.2015.10.077","volume":"70","author":"H Sandhya","year":"2015","unstructured":"Sandhya H, Sandhya PV (2015) $$k$$-medoid clustering for heterogeneous datasets. Proc Computer Sci 70:226\u2013237","journal-title":"Proc Computer Sci"},{"key":"452_CR22","first-page":"197","volume":"5","author":"RE Schapire","year":"1990","unstructured":"Schapire RE (1990) The strength of weak learnability. Mach Learn 5:197\u2013227","journal-title":"Mach Learn"},{"key":"452_CR23","first-page":"1467","volume":"7","author":"L Zanni","year":"2006","unstructured":"Zanni L, Serafini T, Zanghirati G (2006) Parallel software for training large scale support vector machines on multiprocessor systems. J Mach Learn Res 7:1467\u20131492","journal-title":"J Mach Learn Res"},{"key":"452_CR24","first-page":"287","volume":"59","author":"L Zhang","year":"2004","unstructured":"Zhang L, Zhang B (2004) The quotient space theory of problem solving. Fundam Inf 59:287\u2013298","journal-title":"Fundam Inf"},{"key":"452_CR25","doi-asserted-by":"crossref","unstructured":"Zhang L, Zhang B (2006) Hierarchical machine learning \u2013 a learning methodology inspired by human intelligence. In: Rough Sets and Knowledge Technology, Springer","DOI":"10.1007\/11795131_3"}],"container-title":["Advances in Data Analysis and Classification"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11634-021-00452-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11634-021-00452-9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11634-021-00452-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,8,26]],"date-time":"2022-08-26T10:29:33Z","timestamp":1661509773000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11634-021-00452-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,3]]},"references-count":25,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2022,9]]}},"alternative-id":["452"],"URL":"https:\/\/doi.org\/10.1007\/s11634-021-00452-9","relation":{},"ISSN":["1862-5347","1862-5355"],"issn-type":[{"type":"print","value":"1862-5347"},{"type":"electronic","value":"1862-5355"}],"subject":[],"published":{"date-parts":[[2021,7,3]]},"assertion":[{"value":"6 July 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 June 2021","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 June 2021","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 July 2021","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The R, Python, and MATLAB scripts used to generate the toy data sets and perform the experiments are available on GitHub (). The industrial data used for the experiments is property of Tetra Pak Packaging Solutions S.p.A., and access must be authorised by the company.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Software and data"}},{"value":"The authors declare that they have no conflict of interest.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}