{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,17]],"date-time":"2026-03-17T04:59:09Z","timestamp":1773723549625,"version":"3.50.1"},"reference-count":59,"publisher":"MDPI AG","issue":"13","license":[{"start":{"date-parts":[[2021,6,29]],"date-time":"2021-06-29T00:00:00Z","timestamp":1624924800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>Infrared and visible images (multi-sensor or multi-band images) have many complementary features which can effectively boost the performance of object detection. Recently, convolutional neural networks (CNNs) have seen frequent use to perform object detection in multi-band images. However, it is very difficult for CNNs to extract complementary features from infrared and visible images. In order to solve this problem, a difference maximum loss function is proposed in this paper. The loss function can guide the learning directions of two base CNNs and maximize the difference between features from the two base CNNs, so as to extract complementary and diverse features. In addition, we design a focused feature-enhancement module to make features in the shallow convolutional layer more significant. In this way, the detection performance of small objects can be effectively improved while not increasing the computational cost in the testing stage. Furthermore, since the actual receptive field is usually much smaller than the theoretical receptive field, the deep convolutional layer would not have sufficient semantic features for accurate detection of large objects. To overcome this drawback, a cascaded semantic extension module is added to the deep layer. Through simple multi-branch convolutional layers and dilated convolutions with different dilation rates, the cascaded semantic extension module can effectively enlarge the actual receptive field and increase the detection accuracy of large objects. We compare our detection network with five other state-of-the-art infrared and visible image object detection networks. Qualitative and quantitative experimental results prove the superiority of the proposed detection network.<\/jats:p>","DOI":"10.3390\/rs13132538","type":"journal-article","created":{"date-parts":[[2021,6,29]],"date-time":"2021-06-29T22:39:43Z","timestamp":1625006383000},"page":"2538","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":21,"title":["Infrared and Visible Image Object Detection via Focused Feature Enhancement and Cascaded Semantic Extension"],"prefix":"10.3390","volume":"13","author":[{"given":"Xiaowu","family":"Xiao","sequence":"first","affiliation":[{"name":"School of Automation, Beijing Institute of Technology, Beijing 100081, China"}]},{"given":"Bo","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Automation, Beijing Institute of Technology, Beijing 100081, China"}]},{"given":"Lingjuan","family":"Miao","sequence":"additional","affiliation":[{"name":"School of Automation, Beijing Institute of Technology, Beijing 100081, China"}]},{"given":"Linhao","family":"Li","sequence":"additional","affiliation":[{"name":"School of Automation, Beijing Institute of Technology, Beijing 100081, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6871-8236","authenticated-orcid":false,"given":"Zhiqiang","family":"Zhou","sequence":"additional","affiliation":[{"name":"School of Automation, Beijing Institute of Technology, Beijing 100081, China"}]},{"given":"Jinlei","family":"Ma","sequence":"additional","affiliation":[{"name":"China Helicopter Research and Development Institute, Tianjin 300300, China"}]},{"given":"Dandan","family":"Dong","sequence":"additional","affiliation":[{"name":"College of Petroleum, China University of Petroleum, Karamay 834000, China"}]}],"member":"1968","published-online":{"date-parts":[[2021,6,29]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","article-title":"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks","volume":"39","author":"Ren","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., and Berg, A.C. (2016). SSD: Single Shot MultiBox Detector. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Goyal, P., Girshick, R., He, K., and Dollar, P. (2017, January 22\u201329). Focal Loss for Dense Object Detection. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.324"},{"key":"ref_4","unstructured":"Redmon, J., and Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Wang, J., Chen, K., Yang, S., Loy, C.C., and Lin, D. (2019, January 16\u201320). Region Proposal by Guided Anchoring. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00308"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"820","DOI":"10.3390\/s16060820","article-title":"Pedestrian Detection at Day\/Night Time with Visible and FIR Cameras: A Comparison","volume":"16","author":"Alejandro","year":"2016","journal-title":"Sensors"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Konig, D., Adam, M., Jarvers, C., Layher, G., and Teutsch, M. (2017, January 22\u201325). Fully Convolutional Region Proposal Networks for Multispectral Person Detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Honolulu, HI, USA.","DOI":"10.1109\/CVPRW.2017.36"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"143","DOI":"10.1016\/j.patcog.2018.03.007","article-title":"Unified multi-spectral pedestrian detection based on probabilistic fusion networks","volume":"80","author":"Park","year":"2018","journal-title":"Pattern Recognit. J. Pattern Recognit. Soc."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"161","DOI":"10.1016\/j.patcog.2018.08.005","article-title":"Illumination-aware Faster R-CNN for Robust Multispectral Pedestrian Detection","volume":"85","author":"Li","year":"2019","journal-title":"Pattern Recognit."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Shopovska, I., Jovanov, L., and Philips, W. (2019). Deep Visible and Thermal Image Fusion for Enhanced Pedestrian Visibility. Sensors, 19.","DOI":"10.3390\/s19173727"},{"key":"ref_11","unstructured":"Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Jian, S. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Huang, G., Liu, Z., Laurens, V.D.M., and Weinberger, K.Q. (2016, January 27\u201330). Densely Connected Convolutional Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2017.243"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Hou, Y.L., Song, Y., Hao, X., Shen, Y., and Qian, M. (2018, January 21\u201324). Multispectral pedestrian detection based on deep convolutional neural networks. Proceedings of the 2017 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), Macau, China.","DOI":"10.1109\/ICSPCC.2017.8242507"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. (2017, January 21\u201326). Feature Pyramid Networks for Object Detection. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.106"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Liu, S., Huang, D., and Wang, Y. (2018, January 8\u201314). Receptive Field Block Net for Accurate and Fast Object Detection. Proceedings of the European Conference on Computer Vision, Munich, Germany.","DOI":"10.1007\/978-3-030-01252-6_24"},{"key":"ref_17","unstructured":"Kaiming, H., Georgia, G., Piotr, D., and Ross, G. (2017, January 22\u201329). Mask R-CNN. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Cao, J., Cholakkal, H., Anwer, R.M., Khan, F.S., and Shao, L. (2020, January 14\u201319). D2Det: Towards High Quality Object Detection and Instance Segmentation. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01150"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Dong, Z., Li, G., Liao, Y., Wang, F., Ren, P., and Qian, C. (2020, January 13\u201319). CentripetalNet: Pursuing High-quality Keypoint Pairs for Object Detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01053"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1006\/cviu.2001.0934","article-title":"Empirical Evaluation of Dissimilarity Measures for Color and Texture","volume":"84","author":"Rubner","year":"2001","journal-title":"Comput. Vis. Image Underst."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014, January 23\u201328). Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. Proceedings of the Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.81"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"1904","DOI":"10.1109\/TPAMI.2015.2389824","article-title":"Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition","volume":"37","author":"He","year":"2014","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Girshick, R. (2015, January 7\u201313). Fast r-cnn. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.169"},{"key":"ref_24","unstructured":"Viola, P. (2001, January 8\u201314). Rapid Object Detection using a Boosted Cascade of Simple Features. Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, HI, USA."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"137","DOI":"10.1023\/B:VISI.0000013087.49260.fb","article-title":"Robust Real-Time Face Detection","volume":"57","author":"Viola","year":"2004","journal-title":"Int. J. Comput. Vis."},{"key":"ref_26","unstructured":"Dalal, N. (2005, January 20\u201326). Histograms of oriented gradients for human detection. Proceedings of the 2005 IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Felzenszwalb, P.F., Mcallester, D.A., and Ramanan, D. (2008, January 23\u201328). A discriminatively trained, multiscale, deformable part model. Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA.","DOI":"10.1109\/CVPR.2008.4587597"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Felzenszwalb, P.F., Girshick, R.B., and Mcallester, D.A. (2010, January 13\u201318). Cascade object detection with deformable part models. Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA.","DOI":"10.1109\/CVPR.2010.5539906"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"1627","DOI":"10.1109\/TPAMI.2009.167","article-title":"Object Detection with Discriminatively Trained Part-Based Models","volume":"32","author":"Felzenszwalb","year":"2010","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_30","first-page":"442","article-title":"Object detection with grammar models","volume":"24","author":"Girshick","year":"2011","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_31","unstructured":"Girshick, R.B. (2012). From Rigid Templates to Grammars: Object Detection with Structured Models, University of Chicago."},{"key":"ref_32","unstructured":"Bochkovskiy, A., Wang, C.Y., and Liao, H. (2020). YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Zhou, P., and Geng, C. (2018, January 18\u201323). Transmission. Scale-Transferrable Object Detection. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00062"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Singh, B., and Davis, L.S. (2018, January 18\u201323). An Analysis of Scale Invariance in Object Detection\u2014SNIP. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00377"},{"key":"ref_35","unstructured":"Singh, B., Najibi, M., and Davis, L.S. (2018). SNIPER: Efficient Multi-Scale Training. arXiv."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Zhu, C., He, Y., and Savvides, M. (2019, January 16\u201320). Feature Selective Anchor-Free Module for Single-Shot Object Detection. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00093"},{"key":"ref_37","unstructured":"Hong, S., Roh, B., Kim, K.H., Cheon, Y., and Park, M. (2016). Pvanet: Lightweight deep neural networks for real-time object detection. arXiv."},{"key":"ref_38","first-page":"249","article-title":"Understanding the difficulty of training deep feedforward neural networks","volume":"9","author":"Glorot","year":"2010","journal-title":"J. Mach. Learn. Res."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2015, January 7\u201313). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile.","DOI":"10.1109\/ICCV.2015.123"},{"key":"ref_40","unstructured":"Luo, W., Li, Y., Urtasun, R., and Zemel, R. (2017, January 21\u201326). Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"834","DOI":"10.1109\/TPAMI.2017.2699184","article-title":"DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs","volume":"40","author":"Chen","year":"2018","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_42","unstructured":"Yu, F., and Koltun, V. (2015). Multi-Scale Context Aggregation by Dilated Convolutions. arXiv."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Zhang, X., Zou, Y., and Wei, S. (2017, January 23\u201325). Dilated convolution neural network with LeakyReLU for environmental sound classification. Proceedings of the 2017 22nd International Conference on Digital Signal Processing (DSP), London, UK.","DOI":"10.1109\/ICDSP.2017.8096153"},{"key":"ref_44","unstructured":"Qiao, Z., Cui, Z., Niu, X., Geng, S., and Yu, Q. (2017). Image Segmentation with Pyramid Dilated Convolution Based on ResNet and U-Net. International Conference on Neural Information Processing, Springer."},{"key":"ref_45","unstructured":"Chen, L.C., Papandreou, G., Schroff, F., and Adam, H. (2014, January 23\u201328). Rethinking Atrous Convolution for Semantic Image Segmentation. Proceedings of the Computer Vision and Pattern Recognition, Columbus, OH, USA."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13638-019-1445-x","article-title":"AtICNet: Semantic segmentation with atrous spatial pyramid pooling in image cascade network","volume":"2019","author":"Chen","year":"2019","journal-title":"EURASIP J. Wirel. Commun. Netw."},{"key":"ref_47","first-page":"1097","article-title":"ImageNet Classification with Deep Convolutional Neural Networks","volume":"25","author":"Krizhevsky","year":"2012","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_48","unstructured":"Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2019, March 14). Automatic Differentiation in Pytorch. Available online: https:\/\/openreview.net\/forum?id=BJJsrmfCZ."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Toet, A., Hogervorst, M.A., and Pinkus, A.R. (2016). The TRICLOBS Dynamic Multi-Band Image Data Set for the Development and Evaluation of Image Fusion Methods. PLoS ONE, 11.","DOI":"10.1371\/journal.pone.0165016"},{"key":"ref_50","unstructured":"(2019, July 25). CVC14 Dataset. Available online: http:\/\/adas.cvc.uab.es\/elektra\/enigma-portfolio\/cvc-14-visible-fir-day-night-pedestrian-sequence-dataset."},{"key":"ref_51","unstructured":"Liu, S., and Liu, Z. (2017, January 21\u201326). Multi-Channel CNN-based Object Detection for Enhanced Situation Awareness. Proceedings of the Computer Vision and Pattern Recognition, Honolulu, HI, USA."},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"15","DOI":"10.1016\/j.inffus.2015.11.003","article-title":"Perceptual fusion of infrared and visible images through a hybrid multi-scale decomposition with Gaussian and bilateral filters","volume":"30","author":"Zhou","year":"2016","journal-title":"Inf. Fusion"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Liu, J., Zhang, S., Wang, S., and Metaxas, D.N. (2016). Multispectral Deep Neural Networks for Pedestrian Detection. arXiv.","DOI":"10.5244\/C.30.73"},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"2864","DOI":"10.1109\/TIP.2013.2244222","article-title":"Image Fusion With Guided Filtering","volume":"22","author":"Li","year":"2013","journal-title":"IEEE Trans. Image Process."},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"94","DOI":"10.1016\/j.infrared.2013.07.010","article-title":"Image fusion based on nonsubsampled contourlet transform for infrared and visible light image","volume":"61","author":"Adu","year":"2013","journal-title":"Infrared Phys. Technol."},{"key":"ref_56","doi-asserted-by":"crossref","first-page":"147","DOI":"10.1016\/j.inffus.2014.09.004","article-title":"A general framework for image fusion based on multi-scale transform and sparse representation","volume":"24","author":"Liu","year":"2015","journal-title":"Inf. Fusion"},{"key":"ref_57","doi-asserted-by":"crossref","first-page":"11","DOI":"10.1016\/j.infrared.2015.11.003","article-title":"An adaptive fusion approach for infrared and visible images based on NSCT and compressed sensing","volume":"74","author":"Zhang","year":"2016","journal-title":"Infrared Phys. Technol."},{"key":"ref_58","doi-asserted-by":"crossref","first-page":"8","DOI":"10.1016\/j.infrared.2017.02.005","article-title":"Infrared and visible image fusion based on visual saliency map and weighted least square optimization","volume":"82","author":"Ma","year":"2017","journal-title":"Infrared Phys. Technol."},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Aslam, A. (2020, January 8\u201311). Object Detection for Unseen Domains while Reducing Response Time using Knowledge Transfer in Multimedia Event Processing. Proceedings of the ICMR 20 International Conference on Multimedia Retrieval, Dublin, Ireland.","DOI":"10.1145\/3372278.3391936"}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/13\/13\/2538\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:26:57Z","timestamp":1760164017000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/13\/13\/2538"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,6,29]]},"references-count":59,"journal-issue":{"issue":"13","published-online":{"date-parts":[[2021,7]]}},"alternative-id":["rs13132538"],"URL":"https:\/\/doi.org\/10.3390\/rs13132538","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,6,29]]}}}