{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,15]],"date-time":"2026-01-15T08:34:42Z","timestamp":1768466082491,"version":"3.49.0"},"reference-count":54,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2024,4,15]],"date-time":"2024-04-15T00:00:00Z","timestamp":1713139200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,4,15]],"date-time":"2024-04-15T00:00:00Z","timestamp":1713139200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62176239"],"award-info":[{"award-number":["62176239"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Complex Intell. Syst."],"published-print":{"date-parts":[[2024,8]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The incorporation of thermal imaging data in RGB-T images has demonstrated its usefulness in cross-modal crowd counting by offering complementary information to RGB representations. Despite achieving satisfactory results in RGB-T crowd counting, many existing methods still face two significant limitations: (1) The oversight of the heterogeneous gap between modalities complicates the effective integration of multimodal features. (2) The absence of mining consistency hinders the full exploitation of the unique complementary strengths inherent in each modality. To this end, we present C4-MIM, a novel Consistency-constrained RGB-T Crowd Counting approach via Mutual Information Maximization. It effectively leverages multimodal information by learning the consistency between the RGB and thermal modalities, thereby enhancing the performance of cross-modal counting. Specifically, we first advocate extracting feature representations of different modalities in a shared encoder to moderate the heterogeneous gap since they obey the identical coding rules with shared parameters. Then, we intend to mine the consistent information of different modalities to better learn conducive information and improve the performance of feature representations. To this end, we formulate the complementarity of multimodality representations as a mutual information maximization regularizer to maximize the consistent information of different modalities, in which the consistency would be maximally attained before combining the multimodal information. Finally, we simply aggregate the feature representations of the different modalities and send them into a regressor to output the density maps. The proposed approach can be implemented by arbitrary backbone networks and is quite robust in the face of single modality unavailable or serious compromised. Extensively experiments have been conducted on the RGBT-CC and DroneRGBT benchmarks to evaluate the effectiveness and robustness of the proposed approach, demonstrating its superior performance compared to the SOTA approaches.<\/jats:p>","DOI":"10.1007\/s40747-024-01427-x","type":"journal-article","created":{"date-parts":[[2024,4,15]],"date-time":"2024-04-15T09:01:59Z","timestamp":1713171719000},"page":"5049-5070","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":6,"title":["Consistency-constrained RGB-T crowd counting via mutual information maximization"],"prefix":"10.1007","volume":"10","author":[{"given":"Qiang","family":"Guo","sequence":"first","affiliation":[]},{"given":"Pengcheng","family":"Yuan","sequence":"additional","affiliation":[]},{"given":"Xiangming","family":"Huang","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7027-8313","authenticated-orcid":false,"given":"Yangdong","family":"Ye","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,4,15]]},"reference":[{"key":"1427_CR1","unstructured":"Belghazi MI, Baratin A, Rajeswar S, et\u00a0al (2018) Mutual information neural estimation. In: International conference on machine learning, pp 530\u2013539"},{"key":"1427_CR2","doi-asserted-by":"crossref","unstructured":"Chan AB, Vasconcelos N (2009) Bayesian poisson regression for crowd counting. In: 2009 IEEE 12th international conference on computer vision, pp 545\u2013551","DOI":"10.1109\/ICCV.2009.5459191"},{"key":"1427_CR3","doi-asserted-by":"crossref","unstructured":"Cheng Z, Li J, Dai Q, et\u00a0al (2019) Improving the learning of multi-column convolutional neural network for crowd counting. In: Proceedings of the 27th ACM International Conference on Multimedia, pp 1897\u20131906","DOI":"10.1145\/3343031.3350898"},{"key":"1427_CR4","unstructured":"Faivishevsky L, Goldberger J (2008) ICA based on a smooth estimation of the differential entropy. In: Advances in neural information processing systems, pp 433\u2013440"},{"key":"1427_CR5","doi-asserted-by":"crossref","unstructured":"Fan D, Zhai Y, Borji A, et\u00a0al (2020) Bbs-net: RGB-D salient object detection with a bifurcated backbone strategy network. In: Proceedings of the European Conference on Computer Vision, pp 275\u2013292","DOI":"10.1007\/978-3-030-58610-2_17"},{"key":"1427_CR6","unstructured":"Gao G, Gao J, Liu Q, et\u00a0al (2020a) Cnn-based density estimation and crowd counting: A survey. CoRR abs\/2003.12783"},{"key":"1427_CR7","doi-asserted-by":"crossref","unstructured":"Gao J, Hua Y, Hu G, et\u00a0al (2020b) Reducing distributional uncertainty by mutual information maximisation and transferable feature learning. In: Proceedings of the European Conference on Computer Vision, pp 587\u2013605","DOI":"10.1007\/978-3-030-58592-1_35"},{"issue":"1","key":"1427_CR8","doi-asserted-by":"publisher","first-page":"317","DOI":"10.1007\/s40747-022-00792-9","volume":"9","author":"P Guo","year":"2023","unstructured":"Guo P, Xie G, Li R et al (2023) Multimodal medical image fusion with convolution sparse representation and mutual information correlation in nsst domain. Complex Intell Syst 9(1):317\u2013328","journal-title":"Complex Intell Syst"},{"key":"1427_CR9","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2020.106691","volume":"213","author":"Q Guo","year":"2021","unstructured":"Guo Q, Zeng X, Hu S et al (2021) Learning a deep network with cross-hierarchy aggregation for crowd counting. Knowl Based Syst 213:106691","journal-title":"Knowl Based Syst"},{"key":"1427_CR10","unstructured":"Hjelm RD, Fedorov A, Lavoie-Marchildon S, et\u00a0al (2019) Learning deep representations by mutual information estimation and maximization. In: Proceedings of the International Conference on Learning Representations"},{"key":"1427_CR11","first-page":"2547","volume":"2013","author":"H Idrees","year":"2013","unstructured":"Idrees H, Saleemi I, Seibert C et al (2013) Multi-source multi-scale counting in extremely dense crowd images. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition 2013:2547\u20132554","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"issue":"5","key":"1427_CR12","doi-asserted-by":"publisher","first-page":"1408","DOI":"10.1109\/TCSVT.2018.2837153","volume":"29","author":"D Kang","year":"2019","unstructured":"Kang D, Ma Z, Chan AB (2019) Beyond counting: Comparisons of density maps for crowd analysis tasks - counting, detection, and tracking. IEEE Trans Circuits Syst Video Technol 29(5):1408\u20131422","journal-title":"IEEE Trans Circuits Syst Video Technol"},{"key":"1427_CR13","doi-asserted-by":"crossref","unstructured":"Kemertas M, Pishdad L, Derpanis KG, et\u00a0al (2020) Rankmi: A mutual information maximizing ranking loss. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 14350\u201314359","DOI":"10.1109\/CVPR42600.2020.01437"},{"key":"1427_CR14","unstructured":"Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations"},{"key":"1427_CR15","doi-asserted-by":"crossref","unstructured":"Li F, Zhou Y, Chen Y, et\u00a0al (2023a) Multi-scale attention-based lightweight network with dilated convolutions for infrared and visible image fusion. Complex Intell Syst pp 1\u201315","DOI":"10.1007\/s40747-023-01185-2"},{"key":"1427_CR16","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2022.109944","volume":"257","author":"H Li","year":"2022","unstructured":"Li H, Zhang S, Kong W (2022) Learning the cross-modal discriminative feature representation for RGB-T crowd counting. Knowl Based Syst 257:109944","journal-title":"Knowl Based Syst"},{"key":"1427_CR17","doi-asserted-by":"crossref","unstructured":"Li H, Zhang J, Kong W, et\u00a0al (2023b) Csa-net: Cross-modal scale-aware attention-aggregated network for RGB-T crowd counting. Expert Syst Appl 213(Part):119038","DOI":"10.1016\/j.eswa.2022.119038"},{"key":"1427_CR18","doi-asserted-by":"crossref","unstructured":"Li Y, Zhang X, Chen D (2018) Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 1091\u20131100","DOI":"10.1109\/CVPR.2018.00120"},{"key":"1427_CR19","doi-asserted-by":"publisher","first-page":"2461","DOI":"10.1109\/TMM.2021.3081930","volume":"24","author":"Z Li","year":"2022","unstructured":"Li Z, Tang C, Liu X et al (2022) Consensus graph learning for multi-view clustering. IEEE Trans Multim 24:2461\u20132472","journal-title":"IEEE Trans Multim"},{"key":"1427_CR20","doi-asserted-by":"crossref","unstructured":"Lian D, Li J, Zheng J, et\u00a0al (2019) Density map regression guided detection network for RGB-D crowd counting and localization. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 1821\u20131830","DOI":"10.1109\/CVPR.2019.00192"},{"key":"1427_CR21","doi-asserted-by":"crossref","unstructured":"Lin H, Ma Z, Ji R, et\u00a0al (2022) Boosting crowd counting via multifaceted attention. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 19628\u201319637","DOI":"10.1109\/CVPR52688.2022.01901"},{"key":"1427_CR22","doi-asserted-by":"crossref","unstructured":"Liu L, Qiu Z, Li G, et\u00a0al (2019) Crowd counting with deep structured scale integration network. In: Proceedings of the IEEE\/CVF international conference on computer vision, pp 1774\u20131783","DOI":"10.1109\/ICCV.2019.00186"},{"key":"1427_CR23","doi-asserted-by":"crossref","unstructured":"Liu L, Chen J, Wu H, et\u00a0al (2020a) Efficient crowd counting via structured knowledge transfer. In: Proceedings of the 28th ACM international conference on multimedia, pp 2645\u20132654","DOI":"10.1145\/3394171.3413938"},{"key":"1427_CR24","doi-asserted-by":"crossref","unstructured":"Liu L, Lu H, Zou H, et\u00a0al (2020b) Weighing counts: Sequential crowd counting by reinforcement learning. In: Proceedings of the European Conference on Computer Vision, pp 164\u2013181","DOI":"10.1007\/978-3-030-58607-2_10"},{"key":"1427_CR25","doi-asserted-by":"crossref","unstructured":"Liu L, Chen J, Wu H, et\u00a0al (2021a) Cross-modal collaborative representation learning and a large-scale RGBT benchmark for crowd counting. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 4823\u20134833","DOI":"10.1109\/CVPR46437.2021.00479"},{"issue":"11","key":"1427_CR26","doi-asserted-by":"publisher","first-page":"7169","DOI":"10.1109\/TITS.2020.3002718","volume":"22","author":"L Liu","year":"2021","unstructured":"Liu L, Zhen J, Li G et al (2021) Dynamic spatial-temporal representation learning for traffic flow prediction. IEEE Trans Intell Transp Syst 22(11):7169\u20137183","journal-title":"IEEE Trans Intell Transp Syst"},{"key":"1427_CR27","doi-asserted-by":"crossref","unstructured":"Liu W, Salzmann M, Fua P (2020c) Estimating people flows to better count them in crowded scenes. In: Proceedings of the European Conference on Computer Vision, pp 723\u2013740","DOI":"10.1007\/978-3-030-58555-6_43"},{"key":"1427_CR28","doi-asserted-by":"crossref","unstructured":"Liu Z, Feng R, Chen H, et\u00a0al (2022) Temporal feature alignment and mutual information maximization for video-based human pose estimation. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 10996\u201311006","DOI":"10.1109\/CVPR52688.2022.01073"},{"key":"1427_CR29","doi-asserted-by":"crossref","unstructured":"Ma Z, Wei X, Hong X, et\u00a0al (2019) Bayesian loss for crowd count estimation with point supervision. In: Proceedings of the IEEE\/CVF international conference on computer vision, pp 6141\u20136150","DOI":"10.1109\/ICCV.2019.00624"},{"key":"1427_CR30","doi-asserted-by":"crossref","unstructured":"Ma Z, Wei X, Hong X, et\u00a0al (2021) Learning to count via unbalanced optimal transport. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 2319\u20132327","DOI":"10.1609\/aaai.v35i3.16332"},{"key":"1427_CR31","doi-asserted-by":"crossref","unstructured":"Mao Y, Yan X, Guo Q, et\u00a0al (2021) Deep mutual information maximin for cross-modal clustering. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 8893\u20138901","DOI":"10.1609\/aaai.v35i10.17076"},{"key":"1427_CR32","doi-asserted-by":"publisher","DOI":"10.1016\/j.engappai.2023.106885","volume":"126","author":"Y Pan","year":"2023","unstructured":"Pan Y, Zhou W, Qian X et al (2023) Cginet: Cross-modality grade interaction network for rgb-t crowd counting. Eng Appl Artif Intell 126:106885","journal-title":"Eng Appl Artif Intell"},{"key":"1427_CR33","doi-asserted-by":"crossref","unstructured":"Pang Y, Zhang L, Zhao X, et\u00a0al (2020) Hierarchical dynamic filtering network for RGB-D salient object detection. In: Proceedings of the European Conference on Computer Vision, pp 235\u2013252","DOI":"10.1007\/978-3-030-58595-2_15"},{"key":"1427_CR34","unstructured":"Paszke A, Gross S, Massa F, et\u00a0al (2019) Pytorch: An imperative style, high-performance deep learning library. In: Advances in neural information processing systems, pp 8024\u20138035"},{"key":"1427_CR35","doi-asserted-by":"crossref","unstructured":"Peng T, Li Q, Zhu P (2020) RGB-T crowd counting from drone: A benchmark and MMCCN network. In: Proceedings of the Asian conference on computer vision, pp 497\u2013513","DOI":"10.1007\/978-3-030-69544-6_30"},{"key":"1427_CR36","doi-asserted-by":"crossref","unstructured":"Shu W, Wan J, Tan KC, et\u00a0al (2022) Crowd counting in the frequency domain. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 19618\u201319627","DOI":"10.1109\/CVPR52688.2022.01900"},{"key":"1427_CR37","doi-asserted-by":"crossref","unstructured":"Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning Representations","DOI":"10.1109\/ICCV.2015.314"},{"key":"1427_CR38","doi-asserted-by":"crossref","unstructured":"Viola PA, Jones MJ, Snow D (2003) Detecting pedestrians using patterns of motion and appearance. In: Proceedings of the IEEE\/CVF international conference on computer vision, pp 734\u2013741","DOI":"10.1109\/ICCV.2003.1238422"},{"key":"1427_CR39","unstructured":"Wang B, Liu H, Samaras D, et\u00a0al (2020) Distribution matching for crowd counting. In: Advances in neural information processing systems, pp 1595\u20131607"},{"key":"1427_CR40","doi-asserted-by":"publisher","first-page":"306","DOI":"10.1016\/j.ins.2022.01.046","volume":"591","author":"F Wang","year":"2022","unstructured":"Wang F, Sang J, Wu Z et al (2022) Hybrid attention network based on progressive embedding scale-context for crowd counting. Inf Sci 591:306\u2013318","journal-title":"Inf Sci"},{"key":"1427_CR41","doi-asserted-by":"crossref","unstructured":"Wu Z, Liu L, Zhang Y, et\u00a0al (2022) Multimodal crowd counting with mutual attention transformers. In: 2022 IEEE International Conference on Multimedia and Expo, pp 1\u20136","DOI":"10.1109\/ICME52920.2022.9859777"},{"key":"1427_CR42","doi-asserted-by":"crossref","unstructured":"Yu G, Cai R, Luo Y, et\u00a0al (2023) A-pruning: a lightweight pineapple flower counting network based on filter pruning. Complex Intell Syst pp 1\u201320","DOI":"10.2139\/ssrn.4196753"},{"key":"1427_CR43","doi-asserted-by":"crossref","unstructured":"Zeng X, Wu Y, Hu S, et\u00a0al (2020) Dspnet: Deep scale purifier network for dense crowd counting. Expert Syst Appl 141","DOI":"10.1016\/j.eswa.2019.112977"},{"key":"1427_CR44","doi-asserted-by":"crossref","unstructured":"Zhang B, Du Y, Zhao Y, et\u00a0al (2021a) I-MMCCN: improved MMCCN for RGB-T crowd counting of drone images. In: 2021 7th IEEE International Conference on Network Intelligence and Digital Content, pp 117\u2013121","DOI":"10.1109\/IC-NIDC54101.2021.9660586"},{"key":"1427_CR45","doi-asserted-by":"crossref","unstructured":"Zhang J, Fan D, Dai Y, et\u00a0al (2020) Uc-net: Uncertainty inspired RGB-D saliency detection via conditional variational autoencoders. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 8579\u20138588","DOI":"10.1109\/CVPR42600.2020.00861"},{"key":"1427_CR46","doi-asserted-by":"crossref","unstructured":"Zhang Q, Chan AB (2019) Wide-area crowd counting via ground-plane density maps and multi-view fusion cnns. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 8297\u20138306","DOI":"10.1109\/CVPR.2019.00849"},{"key":"1427_CR47","doi-asserted-by":"crossref","unstructured":"Zhang Q, Lin W, Chan AB (2021b) Cross-view cross-scene multi-view crowd counting. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 557\u2013567","DOI":"10.1109\/CVPR46437.2021.00062"},{"key":"1427_CR48","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2021.115071","volume":"180","author":"S Zhang","year":"2021","unstructured":"Zhang S, Li H, Kong W (2021) A cross-modal fusion based approach with scale-aware deep representation for RGB-D crowd counting and density estimation. Expert Syst Appl 180:115071","journal-title":"Expert Syst Appl"},{"key":"1427_CR49","unstructured":"Zhang S, Yang L, Mi MB, et\u00a0al (2023a) Improving deep regression with ordinal entropy. In: Proceedings of the International Conference on Learning Representations"},{"key":"1427_CR50","doi-asserted-by":"crossref","unstructured":"Zhang Y, Zhang Z, Zhang P, et\u00a0al (2023b) Salient object detection for rgbd video via spatial interaction and depth-based boundary refinement. Complex Intell Syst pp 1\u201316","DOI":"10.1007\/s40747-023-01072-w"},{"issue":"7","key":"1427_CR51","doi-asserted-by":"publisher","first-page":"1198","DOI":"10.1109\/TPAMI.2007.70770","volume":"30","author":"T Zhao","year":"2008","unstructured":"Zhao T, Nevatia R, Wu B (2008) Segmentation and tracking of multiple humans in crowded environments. IEEE Trans Pattern Anal Mach Intell 30(7):1198\u20131211","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"1427_CR52","doi-asserted-by":"crossref","unstructured":"Zhou M, Yan K, Huang J, et\u00a0al (2022a) Mutual information-driven pan-sharpening. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 1788\u20131798","DOI":"10.1109\/CVPR52688.2022.00184"},{"issue":"12","key":"1427_CR53","doi-asserted-by":"publisher","first-page":"24540","DOI":"10.1109\/TITS.2022.3203385","volume":"23","author":"W Zhou","year":"2022","unstructured":"Zhou W, Pan Y, Lei J et al (2022) Defnet: Dual-branch enhanced feature fusion network for RGB-T crowd counting. IEEE Trans Intell Transp Syst 23(12):24540\u201324549","journal-title":"IEEE Trans Intell Transp Syst"},{"key":"1427_CR54","doi-asserted-by":"crossref","unstructured":"Zhou W, Yang X, Lei J, et\u00a0al (2023) MC$$^3$$Net: Multimodality cross-guided compensation coordination network for rgb-t crowd counting. IEEE Trans Intell Transp Syst","DOI":"10.1109\/TITS.2023.3321328"}],"container-title":["Complex &amp; Intelligent Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-024-01427-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s40747-024-01427-x\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-024-01427-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,16]],"date-time":"2024-11-16T07:48:27Z","timestamp":1731743307000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s40747-024-01427-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,4,15]]},"references-count":54,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,8]]}},"alternative-id":["1427"],"URL":"https:\/\/doi.org\/10.1007\/s40747-024-01427-x","relation":{},"ISSN":["2199-4536","2198-6053"],"issn-type":[{"value":"2199-4536","type":"print"},{"value":"2198-6053","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,4,15]]},"assertion":[{"value":"31 October 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 March 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"15 April 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no known competed financial interests or personal relationships that could have appeared to influence the work reported in this paper.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}