{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T14:43:06Z","timestamp":1777646586964,"version":"3.51.4"},"reference-count":18,"publisher":"Springer Science and Business Media LLC","issue":"29","license":[{"start":{"date-parts":[[2025,9,2]],"date-time":"2025-09-02T00:00:00Z","timestamp":1756771200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,9,2]],"date-time":"2025-09-02T00:00:00Z","timestamp":1756771200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100008814","name":"Universidade do Minho","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100008814","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Neural Comput &amp; Applic"],"published-print":{"date-parts":[[2025,10]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Object detection is one of the most fundamental problems to tackle in the computer vision research area. Recent advances in multimodal data streams and deep learning architectures have prompted a fast growth in the field of multimodal learning, which brings several advantages over single-modality approaches for object detection, such as improved accuracy, robustness to noise and ambiguity, handling of complex scenarios and adaptability to diverse data. Some of the biggest challenges when implementing a multimodal learning approach are the selection of the fusion strategy, design of processing architecture, modality alignment\/synchronization and interpretability of such high-dimensional representations. To address this challenge, we propose a feature-level fusion architecture for object detection based on extracting YOLO features from images, spectral and rhythm features from sound using Mel-frequency cepstral coefficients, and general descriptors from radar modalities that, after timestamp and homography transformation matrix alignment, are combined with an attention mechanism into a single classification network. Preliminary experiments indicate that the proposed architecture can constitute itself as a base pipeline for several different multimodal object detection tasks in real-world applications.<\/jats:p>","DOI":"10.1007\/s00521-025-11521-x","type":"journal-article","created":{"date-parts":[[2025,9,2]],"date-time":"2025-09-02T04:37:46Z","timestamp":1756787866000},"page":"23799-23810","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Multimodal object detection: an architecture using feature-level fusion and deep learning"],"prefix":"10.1007","volume":"37","author":[{"given":"Rui","family":"Silva","sequence":"first","affiliation":[]},{"given":"Eduardo","family":"Coelho","sequence":"additional","affiliation":[]},{"given":"Nuno","family":"Pimenta","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8313-7023","authenticated-orcid":false,"given":"Dalila","family":"Dur\u00e3es","sequence":"additional","affiliation":[]},{"given":"Victor","family":"Alves","sequence":"additional","affiliation":[]},{"given":"Louren\u00e7o","family":"Bandeira","sequence":"additional","affiliation":[]},{"given":"Jos\u00e9","family":"Machado","sequence":"additional","affiliation":[]},{"given":"Paulo","family":"Novais","sequence":"additional","affiliation":[]},{"given":"Pedro","family":"Melo-Pinto","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,9,2]]},"reference":[{"key":"11521_CR1","doi-asserted-by":"publisher","first-page":"91","DOI":"10.1023\/B:VISI.0000029664.99615.94","volume":"60","author":"DG Lowe","year":"2004","unstructured":"Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vision 60:91\u2013110","journal-title":"Int J Comput Vision"},{"key":"11521_CR2","doi-asserted-by":"crossref","unstructured":"Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, p. 511. CVPR, ???","DOI":"10.1109\/CVPR.2001.990517"},{"key":"11521_CR3","doi-asserted-by":"crossref","unstructured":"Girshick R, Donahue J, Darrel T, Malik, J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580\u2013587. CVPR, ???","DOI":"10.1109\/CVPR.2014.81"},{"issue":"2","key":"11521_CR4","doi-asserted-by":"publisher","first-page":"303","DOI":"10.1007\/s11263-009-0275-4","volume":"88","author":"M Everingham","year":"2010","unstructured":"Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vision 88(2):303\u2013338","journal-title":"Int J Comput Vision"},{"key":"11521_CR5","doi-asserted-by":"crossref","unstructured":"Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanana D, ..., Zitnick CL (2014) Microsfot coco: Common objects in context. In: Proceedings of the 13th European Conference in Computer Vision, pp. 740\u2013755. ECCV, ???","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"11521_CR6","doi-asserted-by":"crossref","unstructured":"Shrivastava A, Gupta A, Girshick R (2016) Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761\u2013769. CVPR, ???","DOI":"10.1109\/CVPR.2016.89"},{"issue":"2","key":"11521_CR7","doi-asserted-by":"publisher","first-page":"423","DOI":"10.1109\/TPAMI.2018.2798607","volume":"41","author":"T Baltru\u0161aitis","year":"2019","unstructured":"Baltru\u0161aitis T, Ahuja C, Morency LP (2019) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423\u2013443","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"11521_CR8","unstructured":"Zhu F, Lapata M (2020) Incorporating textual context into multimodal language understanding. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 4074\u20134085. EMNLP, ???"},{"key":"11521_CR9","doi-asserted-by":"publisher","first-page":"866","DOI":"10.3390\/s19040866","volume":"19","author":"T Ophoff","year":"2019","unstructured":"Ophoff T, Van Beeck K, Goedem\u00e9 T (2019) Exploring rgb+depth fusion for real-time object detection. Sensors 19:866","journal-title":"Sensors"},{"key":"11521_CR10","doi-asserted-by":"publisher","first-page":"364","DOI":"10.1016\/j.neucom.2019.10.025","volume":"378","author":"Q Luo","year":"2020","unstructured":"Luo Q, Ma H, Tang L, Wang Y, Xiong R (2020) 3d-ssd: learning hierarchical features from rgb-d images for amodal 3d object detection. Neurocomputing 378:364\u2013374","journal-title":"Neurocomputing"},{"key":"11521_CR11","doi-asserted-by":"crossref","unstructured":"Song S, Xiao J (2016) Deep sliding shapes for amodal 3d object detection in rgb-d images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 808\u2013816. CVPR, ???","DOI":"10.1109\/CVPR.2016.94"},{"key":"11521_CR12","doi-asserted-by":"publisher","first-page":"149","DOI":"10.1007\/s11042-022-12964-3","volume":"82","author":"Q Dong","year":"2023","unstructured":"Dong Q, Liu Y, Liu X (2023) Drone sound detection system based on feature result-level fusion using deep learning. Multimed Tools Appl 82:149\u2013171","journal-title":"Multimed Tools Appl"},{"key":"11521_CR13","doi-asserted-by":"crossref","unstructured":"Xie Y, Zhang L, Yu X, Xie W (2023) Yolo-ms: Multispectral object detection via feature interaction and self-attention guided fusion. IEEE Transactions on Cognitive and Developmental Systems","DOI":"10.1109\/TCDS.2023.3238181"},{"key":"11521_CR14","doi-asserted-by":"crossref","unstructured":"Coelho E, Pimenta N, Peixoto H, Dur\u00e3es D, Melo-Pinto P, Alves V, Bandeira L, Machado J, Novais P (2023) Multi-agent system for multimodal machine learning object detection. Manuscript Submitted for Publication","DOI":"10.1007\/978-3-031-40725-3_57"},{"key":"11521_CR15","doi-asserted-by":"crossref","unstructured":"Wang CY, Bochkovskiy A, Liao (2023) HYM Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 7464\u20137475. IEEE\/CVF, ???","DOI":"10.1109\/CVPR52729.2023.00721"},{"key":"11521_CR16","doi-asserted-by":"crossref","unstructured":"Silva R, Freitas OG, Melo-Pinto P (2023) Boosting the performance of sota convolution-based networks with dimensionality reduction: An application on hyperspectral images of wine grape berries. Intelligent Systems with Applications 19","DOI":"10.1016\/j.iswa.2023.200252"},{"issue":"9","key":"11521_CR17","doi-asserted-by":"publisher","first-page":"2812","DOI":"10.1039\/C3AY41907J","volume":"6","author":"R Bro","year":"2014","unstructured":"Bro R, Smilde AK (2014) Principal component analysis. Anal Methods 6(9):2812\u20132831","journal-title":"Anal Methods"},{"key":"11521_CR18","doi-asserted-by":"crossref","unstructured":"Molau S, Pitz M, Schluter R,Ney H (2001) Computing mel-frequency cepstral coefficients on the power spectrum. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 73\u201376. IEEE, ???","DOI":"10.1109\/ICASSP.2001.940770"}],"container-title":["Neural Computing and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-025-11521-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00521-025-11521-x\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-025-11521-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,30]],"date-time":"2025-09-30T05:22:48Z","timestamp":1759209768000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00521-025-11521-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,2]]},"references-count":18,"journal-issue":{"issue":"29","published-print":{"date-parts":[[2025,10]]}},"alternative-id":["11521"],"URL":"https:\/\/doi.org\/10.1007\/s00521-025-11521-x","relation":{},"ISSN":["0941-0643","1433-3058"],"issn-type":[{"value":"0941-0643","type":"print"},{"value":"1433-3058","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,9,2]]},"assertion":[{"value":"12 January 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 July 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 September 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}