{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,21]],"date-time":"2026-02-21T03:47:47Z","timestamp":1771645667651,"version":"3.50.1"},"reference-count":50,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2025,2,16]],"date-time":"2025-02-16T00:00:00Z","timestamp":1739664000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,2,16]],"date-time":"2025-02-16T00:00:00Z","timestamp":1739664000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Northeastern University USA"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2025,4]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>We have witnessed significant progress in human-object interaction (HOI) detection. However, relying solely on <jats:italic>mAP<\/jats:italic> (mean Average Precision) scores as a summary metric does not provide sufficient insight into the nuances of model performance (<jats:italic>e.g.<\/jats:italic>, why one model outperforms another), which can hinder further innovation in this field. To address this issue, we introduce a diagnosis toolbox in this paper to offer a detailed quantitative breakdown of HOI detection models, inspired by the success of object detection diagnosis tools. We first conduct a holistic investigation into the HOI detection pipeline. By defining a set of errors and using oracles to fix each one, we quantitatively analyze the significance of different errors based on the <jats:italic>mAP<\/jats:italic> improvement gained from fixing them. Next, we explore the two key sub-tasks of HOI detection: human-object pair localization and interaction classification. For the pair localization task, we compute the coverage of ground-truth human-object pairs and assess the noisiness of the localization results. For the classification task, we measure a model\u2019s ability to distinguish between positive and negative detection results and to classify actual interactions when human-object pairs are correctly localized. We analyze eight state-of-the-art HOI detection models, providing valuable diagnostic insights to guide future research. For instance, our diagnosis reveals that the state-of-the-art model RLIPv2 outperforms others primarily due to its significant improvement in multi-label interaction classification accuracy. Our toolbox is applicable across various methods and datasets and is available at <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/neu-vi.github.io\/Diag-HOI\/\" ext-link-type=\"uri\">https:\/\/neu-vi.github.io\/Diag-HOI\/<\/jats:ext-link>.<\/jats:p>","DOI":"10.1007\/s11263-025-02369-8","type":"journal-article","created":{"date-parts":[[2025,2,16]],"date-time":"2025-02-16T19:09:22Z","timestamp":1739732962000},"page":"2227-2244","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Diagnosing Human-Object Interaction Detectors"],"prefix":"10.1007","volume":"133","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-8610-8031","authenticated-orcid":false,"given":"Fangrui","family":"Zhu","sequence":"first","affiliation":[]},{"given":"Yiming","family":"Xie","sequence":"additional","affiliation":[]},{"given":"Weidi","family":"Xie","sequence":"additional","affiliation":[]},{"given":"Huaizu","family":"Jiang","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,2,16]]},"reference":[{"key":"2369_CR1","doi-asserted-by":"crossref","unstructured":"Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In: IEEE Conference on Computer Vision and Pattern Recognition","DOI":"10.1109\/CVPR.2018.00636"},{"key":"2369_CR2","doi-asserted-by":"crossref","unstructured":"Aneja, J., Deshpande, A., & Schwing, A.G. (2018). Convolutional image captioning. In: IEEE Conference on Computer Vision and Pattern Recognition","DOI":"10.1109\/CVPR.2018.00583"},{"key":"2369_CR3","doi-asserted-by":"crossref","unstructured":"Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., & Parikh, D. (2015). Vqa: Visual question answering. In: International Conference on Computer Vision","DOI":"10.1109\/ICCV.2015.279"},{"key":"2369_CR4","doi-asserted-by":"crossref","unstructured":"Bolya, D., Foley, S., Hays, J., & Hoffman, J. (2020). TIDE: A general toolbox for identifying object detection errors. In: European Conference on Computer Vision","DOI":"10.1007\/978-3-030-58580-8_33"},{"key":"2369_CR5","doi-asserted-by":"crossref","unstructured":"Brown, A., Xie, W., Kalogeiton, V., & Zisserman, A. (2020). Smooth-ap: Smoothing the path towards large-scale image retrieval. In: European Conference on Computer Vision","DOI":"10.1007\/978-3-030-58545-7_39"},{"key":"2369_CR6","doi-asserted-by":"crossref","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In: European Conference on Computer Vision","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"2369_CR7","doi-asserted-by":"crossref","unstructured":"Chao, Y.W., Liu, Y., Liu, X., Zeng, H., & Deng, J. (2018). Learning to detect human-object interactions. In: Winter Conference on Applications of Computer Vision","DOI":"10.1109\/WACV.2018.00048"},{"key":"2369_CR8","doi-asserted-by":"crossref","unstructured":"Chao, Y.W., Wang, Z., He, Y., Wang, J., & Deng, J. (2015). HICO: A benchmark for recognizing human-object interactions in images. In: International Conference on Computer Vision","DOI":"10.1109\/ICCV.2015.122"},{"key":"2369_CR9","unstructured":"Chen, J., & Yanai, K. (2021). QAHOI: Query-based anchors for human-object interaction detection. arXiv preprint arXiv:2112.08647"},{"key":"2369_CR10","unstructured":"Chen, S., Mettes, P., & Snoek, C.G. (2021). Diagnosing errors in video relation detectors. In: British Machine Vision Conference."},{"key":"2369_CR11","doi-asserted-by":"crossref","unstructured":"Feng, Y., Ma, L., Liu, W., & Luo, J. (2019). Unsupervised image captioning. In: IEEE Conference on Computer Vision and Pattern Recognition.","DOI":"10.1109\/CVPR.2019.00425"},{"key":"2369_CR12","doi-asserted-by":"crossref","unstructured":"Gao, C., Xu, J., Zou, Y., & Huang, J.B. (2020). DRG: Dual relation graph for human-object interaction detection. In: European Conference on Computer Vision","DOI":"10.1007\/978-3-030-58610-2_41"},{"key":"2369_CR13","unstructured":"Gupta, S., & Malik, J. (2015). Visual semantic role labeling. arXiv preprint arXiv:1505.04474"},{"key":"2369_CR14","doi-asserted-by":"crossref","unstructured":"Gupta, T., Schwing, A., & Hoiem, D. (2019). No-frills human-object interaction detection: Factorization, layout encodings, and training techniques. In: International Conference on Computer Vision","DOI":"10.1109\/ICCV.2019.00977"},{"key":"2369_CR15","doi-asserted-by":"crossref","unstructured":"Hoiem, D., Chodpathumwan, Y., & Dai, Q. (2012). Diagnosing error in object detectors. In: European Conference on Computer Vision","DOI":"10.1007\/978-3-642-33712-3_25"},{"key":"2369_CR16","doi-asserted-by":"crossref","unstructured":"Hou, Z., Yu, B., Qiao, Y., Peng, X., & Tao, D. (2021). Detecting human-object interaction via fabricated compositional learning. In: IEEE Conference on Computer Vision and Pattern Recognition","DOI":"10.1109\/CVPR46437.2021.01441"},{"key":"2369_CR17","doi-asserted-by":"crossref","unstructured":"Jiang, H., Ma, X., Nie, W., Yu, Z., Zhu, Y., & Anandkumar, A. (2022). Bongard-hoi: Benchmarking few-shot visual reasoning for human-object interactions. In: IEEE Conference on Computer Vision and Pattern Recognition","DOI":"10.1109\/CVPR52688.2022.01847"},{"key":"2369_CR18","doi-asserted-by":"crossref","unstructured":"Kilickaya, M., & Smeulders, A. (2020). Diagnosing rarity in human-object interaction detection. In: IEEE Conference on Computer Vision and Pattern Recognition","DOI":"10.1109\/CVPRW50498.2020.00460"},{"key":"2369_CR19","doi-asserted-by":"crossref","unstructured":"Kim, S., Jung, D., & Cho, M. (2023). Relational context learning for human-object interaction detection. In: IEEE Conference on Computer Vision and Pattern Recognition","DOI":"10.1109\/CVPR52729.2023.00286"},{"key":"2369_CR20","doi-asserted-by":"crossref","unstructured":"Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et\u00a0al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision","DOI":"10.1007\/s11263-016-0981-7"},{"key":"2369_CR21","doi-asserted-by":"crossref","unstructured":"Li, G., Zhu, L., Liu, P., & Yang, Y. (2019). Entangled transformer for image captioning. In: International Conference on Computer Vision","DOI":"10.1109\/ICCV.2019.00902"},{"key":"2369_CR22","unstructured":"Li, Y.L., Fan, H., Qiu, Z., Dou, Y., Xu, L., Fang, H.S., Guo, P., Su, H., Wang, D., Wu, W., et\u00a0al. (2022). Discovering a variety of objects in spatio-temporal human-object interactions. arXiv preprint arXiv:2211.07501"},{"key":"2369_CR23","doi-asserted-by":"crossref","unstructured":"Liao, Y., Zhang, A., Lu, M., Wang, Y., Li, X., & Liu, S. (2022). Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. In: IEEE Conference on Computer Vision and Pattern Recognition","DOI":"10.1109\/CVPR52688.2022.01949"},{"key":"2369_CR24","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In: European Conference on Computer Vision","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"2369_CR25","doi-asserted-by":"crossref","unstructured":"Liu, X., Li, Y.L., & Lu, C. (2022). Highlighting object category immunity for the generalization of human-object interaction detection. In: Association for the Advancement of Artificial Intelligence","DOI":"10.1609\/aaai.v36i2.20075"},{"key":"2369_CR26","doi-asserted-by":"crossref","unstructured":"Liu, X., Li, Y.L., Wu, X., Tai, Y.W., Lu, C., & Tang, C.K. (2022). Interactiveness field in human-object interactions. In: IEEE Conference on Computer Vision and Pattern Recognition","DOI":"10.1109\/CVPR52688.2022.01948"},{"key":"2369_CR27","unstructured":"Lu, J., Yang, J., Batra, D., & Parikh, D. (2016). Hierarchical question-image co-attention for visual question answering. Conference on Neural Information Processing Systems"},{"key":"2369_CR28","doi-asserted-by":"crossref","unstructured":"Ma, S., Wang, Y., Wang, S., & Wei, Y. (2023). Fgahoi: Fine-grained anchors for human-object interaction detection. arXiv preprint arXiv:2301.04019","DOI":"10.1109\/TPAMI.2023.3331738"},{"key":"2369_CR29","doi-asserted-by":"crossref","unstructured":"Ng, T., Balntas, V., Tian, Y., & Mikolajczyk, K. (2020). Solar: second-order loss and attention for image retrieval. In: European Conference on Computer Vision","DOI":"10.1007\/978-3-030-58595-2_16"},{"key":"2369_CR30","doi-asserted-by":"crossref","unstructured":"Radenovi\u0107, F., Tolias, G., & Chum, O. (2018). Fine-tuning cnn image retrieval with no human annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence","DOI":"10.1109\/TPAMI.2018.2846566"},{"key":"2369_CR31","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et\u00a0al. (2021). Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748\u20138763"},{"key":"2369_CR32","doi-asserted-by":"crossref","unstructured":"Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., & Sun, J. (2019). Objects365: A large-scale, high-quality dataset for object detection. In: International Conference on Computer Vision","DOI":"10.1109\/ICCV.2019.00852"},{"key":"2369_CR33","doi-asserted-by":"crossref","unstructured":"Shih, K.J., Singh, S., & Hoiem, D. (2016). Where to look: Focus regions for visual question answering. In: IEEE Conference on Computer Vision and Pattern Recognition","DOI":"10.1109\/CVPR.2016.499"},{"key":"2369_CR34","doi-asserted-by":"crossref","unstructured":"Tamura, M., Ohashi, H., & Yoshinaga, T. (2021). Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information. In: IEEE Conference on Computer Vision and Pattern Recognition","DOI":"10.1109\/CVPR46437.2021.01027"},{"key":"2369_CR35","doi-asserted-by":"crossref","unstructured":"Teichmann, M., Araujo, A., Zhu, M., & Sim, J. (2019). Detect-to-retrieve: Efficient regional aggregation for image search. In: IEEE Conference on Computer Vision and Pattern Recognition","DOI":"10.1109\/CVPR.2019.00525"},{"key":"2369_CR36","doi-asserted-by":"crossref","unstructured":"Ulutan, O., Iftekhar, A., & Manjunath, B.S. (2020). VSGNet: Spatial attention network for detecting human object interactions using graph convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition","DOI":"10.1109\/CVPR42600.2020.01363"},{"key":"2369_CR37","doi-asserted-by":"crossref","unstructured":"Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2016). Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence","DOI":"10.1109\/TPAMI.2016.2587640"},{"key":"2369_CR38","doi-asserted-by":"crossref","unstructured":"Wang, P., Wu, Q., Shen, C., Dick, A., & Van Den\u00a0Hengel, A. (2017). Fvqa: Fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence","DOI":"10.1109\/TPAMI.2017.2754246"},{"key":"2369_CR39","doi-asserted-by":"crossref","unstructured":"Wu, X., Li, Y.L., Liu, X., Zhang, J., Wu, Y., & Lu, C. (2022). Mining cross-person cues for body-part interactiveness learning in hoi detection. In: European Conference on Computer Vision","DOI":"10.1007\/978-3-031-19772-7_8"},{"key":"2369_CR40","doi-asserted-by":"crossref","unstructured":"Yu, Z., Huang, Y., Furuta, R., Yagi, T., Goutsu, Y., & Sato, Y. (2023). Fine-grained affordance annotation for egocentric hand-object interaction videos. In: Winter Conference on Applications of Computer Vision","DOI":"10.1109\/WACV56688.2023.00219"},{"key":"2369_CR41","unstructured":"Yuan, H., Jiang, J., Albanie, S., Feng, T., Huang, Z., Ni, D., & Tang, M. (2022). Rlip: Relational language-image pre-training for human-object interaction detection. In: Conference on Neural Information Processing Systems"},{"key":"2369_CR42","doi-asserted-by":"crossref","unstructured":"Yuan, H., Zhang, S., Wang, X., Albanie, S., Pan, Y., Feng, T., Jiang, J., Ni, D., Zhang, Y., & Zhao, D. (2023). Rlipv2: Fast scaling of relational language-image pre-training. In: International Conference on Computer Vision","DOI":"10.1109\/ICCV51070.2023.01979"},{"key":"2369_CR43","unstructured":"Zhang, A., Liao, Y., Liu, S., Lu, M., Wang, Y., Gao, C., & Li, X. (2021). Mining the benefits of two-stage and one-stage hoi detection. In: Advances in Neural Information Processing Systems"},{"key":"2369_CR44","doi-asserted-by":"crossref","unstructured":"Zhang, F.Z., Campbell, D., Gould, S. (2021). Spatially conditioned graphs for detecting human-object interactions. In: International Conference on Computer Vision","DOI":"10.1109\/ICCV48922.2021.01307"},{"key":"2369_CR45","doi-asserted-by":"crossref","unstructured":"Zhang, F.Z., Campbell, D., & Gould, S. (2022). Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. In: IEEE Conference on Computer Vision and Pattern Recognition","DOI":"10.1109\/CVPR52688.2022.01947"},{"key":"2369_CR46","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Pan, Y., Yao, T., Huang, R., Mei, T., & Chen, C.W. (2022). Exploring structure-aware transformer over interaction proposals for human-object interaction detection. In: IEEE Conference on Computer Vision and Pattern Recognition","DOI":"10.1109\/CVPR52688.2022.01894"},{"key":"2369_CR47","doi-asserted-by":"crossref","unstructured":"Zhong, X., Ding, C., Li, Z., & Huang, S. (2022). Towards hard-positive query mining for detr-based human-object interaction detection. In: European Conference on Computer Vision","DOI":"10.1007\/978-3-031-19812-0_26"},{"key":"2369_CR48","doi-asserted-by":"crossref","unstructured":"Zhou, P., & Chi, M. (2019). Relation parsing neural network for human-object interaction detection. In: International Conference on Computer Vision","DOI":"10.1109\/ICCV.2019.00093"},{"key":"2369_CR49","unstructured":"Zhu, F., Yang, J., & Jiang, H. (2024). Towards flexible visual relationship segmentation. In: Conference on Neural Information Processing Systems"},{"key":"2369_CR50","doi-asserted-by":"crossref","unstructured":"Zou, C., Wang, B., Hu, Y., Liu, J., Wu, Q., Zhao, Y., Li, B., Zhang, C., Zhang, C., Wei, Y., et\u00a0al. (2021). End-to-end human object interaction detection with hoi transformer. In: IEEE Conference on Computer Vision and Pattern Recognition","DOI":"10.1109\/CVPR46437.2021.01165"}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-025-02369-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-025-02369-8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-025-02369-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,30]],"date-time":"2025-03-30T22:11:40Z","timestamp":1743372700000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-025-02369-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,16]]},"references-count":50,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2025,4]]}},"alternative-id":["2369"],"URL":"https:\/\/doi.org\/10.1007\/s11263-025-02369-8","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,16]]},"assertion":[{"value":"19 March 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 January 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 February 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}