{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,13]],"date-time":"2026-03-13T23:49:18Z","timestamp":1773445758757,"version":"3.50.1"},"reference-count":47,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2024,2,27]],"date-time":"2024-02-27T00:00:00Z","timestamp":1708992000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Natural Science Foundation of China","award":["U1903213"],"award-info":[{"award-number":["U1903213"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>Human\u2013object interaction (HOI) detection aims to localize and recognize the relationship between humans and objects, which helps computers understand high-level semantics. In HOI detection, two-stage and one-stage methods have distinct advantages and disadvantages. The two-stage methods can obtain high-quality human\u2013object pair features based on object detection but lack contextual information. The one-stage transformer-based methods can model good global features but cannot benefit from object detection. The ideal model should have the advantages of both methods. Therefore, we propose the Pairwise Convolutional neural network (CNN)-Transformer (PCT), a simple and effective two-stage method. The model both fully utilizes the object detector and has rich contextual information. Specifically, we obtain pairwise CNN features from the CNN backbone. These features are fused with pairwise transformer features to enhance the pairwise representations. The enhanced representations are superior to using CNN and transformer features individually. In addition, the global features of the transformer provide valuable contextual cues. We fairly compare the performance of pairwise CNN and pairwise transformer features in HOI detection. The experimental results show that the previously neglected CNN features still have a significant edge. Compared to state-of-the-art methods, our model achieves competitive results on the HICO-DET and V-COCO datasets.<\/jats:p>","DOI":"10.3390\/e26030205","type":"journal-article","created":{"date-parts":[[2024,2,27]],"date-time":"2024-02-27T11:52:55Z","timestamp":1709034775000},"page":"205","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":8,"title":["Pairwise CNN-Transformer Features for Human\u2013Object Interaction Detection"],"prefix":"10.3390","volume":"26","author":[{"ORCID":"https:\/\/orcid.org\/0009-0007-2639-4772","authenticated-orcid":false,"given":"Hutuo","family":"Quan","sequence":"first","affiliation":[{"name":"College of Computer Science and Technology, Xinjiang University, Urumqi 830017, China"},{"name":"Xinjiang Key Laboratory of Signal Detection and Processing, Xinjiang University, Urumqi 830017, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4115-3244","authenticated-orcid":false,"given":"Huicheng","family":"Lai","sequence":"additional","affiliation":[{"name":"College of Computer Science and Technology, Xinjiang University, Urumqi 830017, China"},{"name":"Xinjiang Key Laboratory of Signal Detection and Processing, Xinjiang University, Urumqi 830017, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4059-9989","authenticated-orcid":false,"given":"Guxue","family":"Gao","sequence":"additional","affiliation":[{"name":"College of Computer Science and Technology, Xinjiang University, Urumqi 830017, China"},{"name":"Xinjiang Key Laboratory of Signal Detection and Processing, Xinjiang University, Urumqi 830017, China"}]},{"given":"Jun","family":"Ma","sequence":"additional","affiliation":[{"name":"College of Computer Science and Technology, Xinjiang University, Urumqi 830017, China"},{"name":"Xinjiang Key Laboratory of Signal Detection and Processing, Xinjiang University, Urumqi 830017, China"}]},{"given":"Junkai","family":"Li","sequence":"additional","affiliation":[{"name":"College of Computer Science and Technology, Xinjiang University, Urumqi 830017, China"},{"name":"Xinjiang Key Laboratory of Signal Detection and Processing, Xinjiang University, Urumqi 830017, China"}]},{"given":"Dongji","family":"Chen","sequence":"additional","affiliation":[{"name":"College of Computer Science and Technology, Xinjiang University, Urumqi 830017, China"},{"name":"Xinjiang Key Laboratory of Signal Detection and Processing, Xinjiang University, Urumqi 830017, China"}]}],"member":"1968","published-online":{"date-parts":[[2024,2,27]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Xiao, Y., Gao, G., Wang, L., and Lai, H. (2022). Optical flow-aware-based multi-modal fusion network for violence detection. Entropy, 24.","DOI":"10.3390\/e24070939"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Lv, J., Hui, T., Zhi, Y., and Xu, Y. (2023). Infrared Image Caption Based on Object-Oriented Attention. Entropy, 25.","DOI":"10.3390\/e25050826"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Wang, L., Yao, W., Chen, C., and Yang, H. (2022). Driving behavior recognition algorithm combining attention mechanism and lightweight network. Entropy, 24.","DOI":"10.3390\/e24070984"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"104617","DOI":"10.1016\/j.imavis.2022.104617","article-title":"Human object interaction detection: Design and survey","volume":"130","author":"Antoun","year":"2023","journal-title":"Image Vis. Comput."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Chao, Y.W., Liu, Y., Liu, X., Zeng, H., and Deng, J. (2018, January 12\u201315). Learning to detect human\u2013object interactions. Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA.","DOI":"10.1109\/WACV.2018.00048"},{"key":"ref_6","unstructured":"Gao, C., Zou, Y., and Huang, J.B. (2018, January 3\u20136). iCAN: Instance-Centric Attention Network for Human\u2013Object Interaction Detection. Proceedings of the British Machine Vision Conference (BMVC), Newcastle, UK."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Gkioxari, G., Girshick, R., Doll\u00e1r, P., and He, K. (2018, January 18\u201323). Detecting and recognizing human\u2013object interactions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2018, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00872"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Liao, Y., Liu, S., Wang, F., Chen, Y., Qian, C., and Feng, J. (2020, January 14\u201319). Ppdm: Parallel point detection and matching for real-time human\u2013object interaction detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00056"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Wang, T., Yang, T., Danelljan, M., Khan, F.S., Zhang, X., and Sun, J. (2020, January 14\u201319). Learning human\u2013object interaction detection using interaction points. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00417"},{"key":"ref_10","unstructured":"Kim, B., Choi, T., Kang, J., and Kim, H.J. (2020). Computer Vision\u2013ECCV 2020, Springer."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Tamura, M., Ohashi, H., and Yoshinaga, T. (2021, January 19\u201325). Qpic: Query-based pairwise human\u2013object interaction detection with image-wide contextual information. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition 2021, Virtual.","DOI":"10.1109\/CVPR46437.2021.01027"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Kim, B., Lee, J., Kang, J., Kim, E.S., and Kim, H.J. (2021, January 19\u201325). Hotr: End-to-end human\u2013object interaction detection with transformers. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Virtual.","DOI":"10.1109\/CVPR46437.2021.00014"},{"key":"ref_13","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020). Computer Vision\u2013ECCV 2020, Springer."},{"key":"ref_14","first-page":"1","article-title":"Attention is all you need","volume":"30","author":"Vaswani","year":"2017","journal-title":"Proc. Adv. Neural Inf. Process. Syst."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Doll\u00e1r, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. (2017, January 21\u201326). Feature pyramid networks for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.106"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Sun, K., Xiao, B., Liu, D., and Wang, J. (2019, January 15\u201320). Deep high-resolution representation learning for human pose estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00584"},{"key":"ref_17","first-page":"17209","article-title":"Mining the benefits of two-stage and one-stage hoi detection","volume":"34","author":"Zhang","year":"2021","journal-title":"Proc. Adv. Neural Inf. Process. Syst."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Ulutan, O., Iftekhar, A., and Manjunath, B.S. (2020, January 14\u201319). Vsgnet: Spatial attention network for detecting human object interactions using graph convolutions. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01363"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Sun, X., Hu, X., Ren, T., and Wu, G. (2020, January 8\u201311). Human object interaction detection via multi-level conditioned network. Proceedings of the 2020 International Conference on Multimedia Retrieval, Dublin, Ireland.","DOI":"10.1145\/3372278.3390671"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Zhang, F.Z., Campbell, D., and Gould, S. (2022, January 18\u201324). Efficient two-stage detection of human\u2013object interactions with a novel unary-pairwise transformer. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01947"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Zhang, F.Z., Campbell, D., and Gould, S. (2021, January 11\u201317). Spatially conditioned graphs for detecting human\u2013object interactions. Proceedings of the IEEE\/CVF International Conference on Computer Vision 2021, Virtual.","DOI":"10.1109\/ICCV48922.2021.01307"},{"key":"ref_22","first-page":"91","article-title":"Faster r-cnn: Towards real-time object detection with region proposal networks","volume":"Volume 28","author":"Cortes","year":"2015","journal-title":"Advances in Neural Information Processing Systems"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Chen, M., Liao, Y., Liu, S., Chen, Z., Wang, F., and Qian, C. (2021, January 19\u201325). Reformulating hoi detection as adaptive set prediction. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition 2021, Virtual.","DOI":"10.1109\/CVPR46437.2021.00889"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Qu, X., Ding, C., Li, X., Zhong, X., and Tao, D. (2022, January 18\u201324). Distillation using oracle queries for transformer-based human\u2013object interaction detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01895"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Liao, Y., Zhang, A., Lu, M., Wang, Y., Li, X., and Liu, S. (2022, January 18\u201324). Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01949"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Wang, G., Guo, Y., Wong, Y., and Kankanhalli, M. (2022, January 10\u201314). Distance Matters in Human\u2013Object Interaction Detection. Proceedings of the 30th ACM International Conference on Multimedia 2022, Lisboa, Portuga.","DOI":"10.1145\/3503161.3547793"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TIM.2021.3118090","article-title":"Multiscale feature interactive network for multifocus image fusion","volume":"70","author":"Liu","year":"2021","journal-title":"IEEE Trans. Instrum. Meas."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"6823","DOI":"10.1109\/TPAMI.2021.3094625","article-title":"Deep feature space: A geometrical perspective","volume":"44","author":"Kansizoglou","year":"2021","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_29","unstructured":"Gao, C., Xu, J., Zou, Y., and Huang, J.B. (2020). Computer Vision\u2013ECCV 2020, Springer."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Liang, Z., Liu, J., Guan, Y., and Rojas, J. (2021, January 27\u201331). Visual-semantic graph attention networks for human\u2013object interaction detection. Proceedings of the 2021 IEEE International Conference on Robotics and Biomimetics (ROBIO), Sanya, China.","DOI":"10.1109\/ROBIO54168.2021.9739429"},{"key":"ref_31","first-page":"3870","article-title":"Transferable Interactiveness Knowledge for Human\u2013Object Interaction Detection","volume":"44","author":"Li","year":"2022","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_32","unstructured":"Wu, X., Li, Y.L., Liu, X., Zhang, J., Wu, Y., and Lu, C. (2022). Computer Vision\u2013ECCV 2022, Springer."},{"key":"ref_33","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (July, January USA). Learning transferable visual models from natural language supervision. Proceedings of the International Conference on Machine Learning PMLR, New York."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Pan, Y., Yao, T., Huang, R., Mei, T., and Chen, C.W. (2022, January 18\u201324). Exploring structure-aware transformer over interaction proposals for human\u2013object interaction detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01894"},{"key":"ref_35","unstructured":"DETR\u2019s Hands on Colab Notebook (2020, May 26). Facebook AI. Available online: https:\/\/github.com\/facebookresearch\/detr."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_37","unstructured":"Gupta, S., and Malik, J. (2015). Visual Semantic Role Labeling. arXiv."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Goyal, P., Girshick, R., He, K., and Doll\u00e1r, P. (2017, January 22\u201329). Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision 2017, Venice, Italy.","DOI":"10.1109\/ICCV.2017.324"},{"key":"ref_39","unstructured":"Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., and Zitnick, C.L. (2014). Computer Vision\u2013ECCV 2014, Springer."},{"key":"ref_40","unstructured":"Loshchilov, I., and Hutter, F. (2019, January 6\u20139). Decoupled Weight Decay Regularization. Proceedings of the International Conference on Learning Representations 2019, New Orleans, LA, USA."},{"key":"ref_41","unstructured":"Tu, D., Min, X., Duan, H., Guo, G., Zhai, G., and Shen, W. (2022). European Conference on Computer Vision, Springer."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"4495","DOI":"10.1007\/s10489-020-01794-1","article-title":"Multi-stream neural network fused with local information and global information for HOI detection","volume":"50","author":"Xia","year":"2020","journal-title":"Appl. Intell."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Zhu, L., Lan, Q., Velasquez, A., Song, H., Kamal, A., Tian, Q., and Niu, S. (2023). SKGHOI: Spatial-Semantic Knowledge Graph for Human\u2013Object Interaction Detection. arXiv.","DOI":"10.1109\/ICDMW60847.2023.00155"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Zou, C., Wang, B., Hu, Y., Liu, J., Wu, Q., Zhao, Y., Li, B., Zhang, C., Zhang, C., and Wei, Y. (2021, January 19\u201325). End-to-end human object interaction detection with hoi transformer. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition 2021, Virtual.","DOI":"10.1109\/CVPR46437.2021.01165"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Li, Z., Zou, C., Zhao, Y., Li, B., and Zhong, S. (March, January 22). Improving human\u2013object interaction detection via phrase learning and label composition. Proceedings of the AAAI Conference on Artificial Intelligence 2022, Online.","DOI":"10.1609\/aaai.v36i2.20041"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Kim, B., Mun, J., On, K.W., Shin, M., Lee, J., and Kim, E.S. (2022, January 18\u201324). Mstr: Multi-scale transformer for end-to-end human\u2013object interaction detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01897"},{"key":"ref_47","unstructured":"Peng, H., Liu, F., Li, Y., Huang, B., Shao, J., Sang, N., and Gao, C. (2023). Parallel Reasoning Network for Human\u2013Object Interaction Detection. arXiv."}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/26\/3\/205\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T14:05:40Z","timestamp":1760105140000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/26\/3\/205"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,27]]},"references-count":47,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2024,3]]}},"alternative-id":["e26030205"],"URL":"https:\/\/doi.org\/10.3390\/e26030205","relation":{},"ISSN":["1099-4300"],"issn-type":[{"value":"1099-4300","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,2,27]]}}}