{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,11]],"date-time":"2025-12-11T21:03:39Z","timestamp":1765487019704,"version":"build-2065373602"},"reference-count":26,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2024,1,16]],"date-time":"2024-01-16T00:00:00Z","timestamp":1705363200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Seoul National University of Science and Technology"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>In object detection, Transformer-based models such as DETR have exhibited state-of-the-art performance, capitalizing on the attention mechanism to handle spatial relations and feature dependencies. One inherent challenge these models face is the intertwined handling of content and positional data within their attention spans, potentially blurring the specificity of the information retrieval process. We consider object detection as a comprehensive task, and simultaneously merging content and positional information like before can exacerbate task complexity. This paper presents the Multi-Task Fusion Detector (MTFD), a novel architecture that innovatively dissects the detection process into distinct tasks, addressing content and position through separate decoders. By utilizing assumed fake queries, the MTFD framework enables each decoder to operate under a presumption of known ancillary information, ensuring more specific and enriched interactions with the feature map. Experimental results affirm that this methodical separation followed by a deliberate fusion not only simplifies the task difficulty of the detection process but also augments accuracy and clarifies the details of each component, providing a fresh perspective on object detection in Transformer-based architectures.<\/jats:p>","DOI":"10.3390\/rs16020353","type":"journal-article","created":{"date-parts":[[2024,1,16]],"date-time":"2024-01-16T04:03:30Z","timestamp":1705377810000},"page":"353","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Improved Object Detection with Content and Position Separation in Transformer"],"prefix":"10.3390","volume":"16","author":[{"given":"Yao","family":"Wang","sequence":"first","affiliation":[{"name":"Graduate School of Automotive Engineering, Seoul National University of Science and Technology, Seoul 01811, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4144-1000","authenticated-orcid":false,"given":"Jong-Eun","family":"Ha","sequence":"additional","affiliation":[{"name":"Department of Mechanical and Automotive Engineering, Seoul National University of Science and Technology, Seoul 01811, Republic of Korea"}]}],"member":"1968","published-online":{"date-parts":[[2024,1,16]]},"reference":[{"key":"ref_1","unstructured":"Redmon, J., and Farhadi, A. (2018). YOLOv3: An Incremental Improvement. arXiv."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","article-title":"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks","volume":"39","author":"Ren","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Tian, Z., Shen, C., Chen, H., and He, T. (November, January 27). FCOS: Fully Convolutional Onestage Object Detection. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea.","DOI":"10.1109\/ICCV.2019.00972"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"318","DOI":"10.1109\/TPAMI.2018.2858826","article-title":"Focal Loss for Dense Object Detection","volume":"42","author":"Lin","year":"2020","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_5","unstructured":"Ge, Z., Liu, S., Wang, F., Li, Z., and Sun, J. (2021). YOLOX: Exceeding YOLO Series in 2021. arXiv."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"He, K., Gkioxari, G., Dollar, P., and Girshick, R. (2017, January 22\u201329). Mask R-CNN. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.322"},{"key":"ref_7","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is All You Need. Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017): Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020, January 23\u201328). End-to-End Object Detection with Transformers. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"ref_9","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weisssenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv."},{"key":"ref_10","unstructured":"Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., and Zhang, L. (2022). DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR. arXiv."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L., and Wang, J. (2021, January 10\u201317). Conditional DETR for Fast Training Convergence. Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00363"},{"key":"ref_12","unstructured":"Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., and Shum, H.-Y. (2022). DINO: DETR with Improved Denoising Anchor Boxes for End-to-End Object Detection. arXiv."},{"key":"ref_13","unstructured":"Zhu, X., Su, W., Lu, L., Li, B., Wang, X., and Dai, J. (2021, January 3\u20137). Deformable DETR: Deformable Transformers for End-to-End Object Detection. Proceedings of the Ninth International Conference on Learning Representations, Virtual Event, Austria."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., and Zhang, L. (2022). DN-DETR: Accelerate DETR Training by Introducing Query Denoising. arXiv.","DOI":"10.1109\/CVPR52688.2022.01325"},{"key":"ref_15","unstructured":"(2023, October 30). Papers with Code\u2014Coco Test-Dev Benchmark in Object Detection. Available online: https:\/\/paperswithcode.com\/sota\/object-detection-on-coco."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Cai, Z., and Vasconcelos, N. (2018, January 18\u201323). Cascade R-CNN: Delving into High Quality Object Detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00644"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Dai, X., Chen, Y., Yang, J., Zhang, P., Yuan, L., and Zhang, L. (2021, January 11\u201317). Dynamic DETR: End-to-End Object Detection with Dynamic Attention. Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.00298"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Sun, P., Zhang, R., Jiang, Y., Kong, T., Xu, C., Zhan, W., Tomizuka, M., Li, L., Yuan, Z., and Wang, C. (2021, January 20\u201325). Sparse R-CNN: End-to-End Object Detection with Learnable Proposals. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01422"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., and Torralba, A. (2017, January 21\u201326). Scene Parsing Through ADE20K Dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.544"},{"key":"ref_20","unstructured":"Zhou, X., Wang, D., and Kr\u00e4henb\u00fchl, P. (2019). Objects as Points. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Sirisha, M., and Sudha, S.V. (2023, January 9\u201310). A Review of Deep Learning-based Object Detection Current and Future Perspectives. Proceedings of the Third International Conference on Sustainable Expert Systems, Nepal.","DOI":"10.1007\/978-981-19-7874-6_69"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"107021","DOI":"10.1016\/j.engappai.2023.107021","article-title":"Transformer for Object Detection: Review and Benchmark","volume":"126","author":"Li","year":"2023","journal-title":"Eng. Appl. Artif. Intell."},{"key":"ref_23","unstructured":"Rekavandi, A.M., Rashidi, S., Boussaid, F., Hoefs, S., Akbas, E., and Bennamoun, M. (2023). Transformers in Small Object Detection: A Benchmark and Survey of State-of-The-Art. arXiv."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_25","unstructured":"(2023, October 30). Available online: https:\/\/colab.research.google.com\/github\/facebookresearch\/detr\/blob\/colab\/notebooks\/detr_attention.ipynb."},{"key":"ref_26","unstructured":"(2023, October 30). Available online: https:\/\/github.com\/IDEA-Research\/DAB-DETR\/blob\/main\/inference_and_visualize.ipynb."}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/16\/2\/353\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T13:47:46Z","timestamp":1760104066000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/16\/2\/353"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,16]]},"references-count":26,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2024,1]]}},"alternative-id":["rs16020353"],"URL":"https:\/\/doi.org\/10.3390\/rs16020353","relation":{},"ISSN":["2072-4292"],"issn-type":[{"type":"electronic","value":"2072-4292"}],"subject":[],"published":{"date-parts":[[2024,1,16]]}}}