{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T05:04:31Z","timestamp":1750309471682,"version":"3.41.0"},"reference-count":22,"publisher":"Association for Computing Machinery (ACM)","issue":"ISS","license":[{"start":{"date-parts":[[2024,10,24]],"date-time":"2024-10-24T00:00:00Z","timestamp":1729728000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Hum.-Comput. Interact."],"published-print":{"date-parts":[[2024,10,24]]},"abstract":"<jats:p>This paper aims to investigate the problem of gaze object prediction in single images. We propose an application-friendly network based on CLIP for gaze object prediction. To avoid domain bias, we utilize a shallow feature adapter that transfers pre-trained features to target-oriented ones. Secondly, we introduce a pooling attention block to exploit the joint representation of multimodal elements, reducing gaze point deviation. Additionally, we introduce a loss that measures the prediction quality by comparing the distribution difference between the model's predictions heatmaps and the ground truth. Extensive experiments demonstrate the superior performance of our model compared to previous models. We will provide the method code at: https:\/\/github.com\/fadaishaitaiyang\/CCLIP.git.<\/jats:p>","DOI":"10.1145\/3698132","type":"journal-article","created":{"date-parts":[[2024,10,24]],"date-time":"2024-10-24T19:23:51Z","timestamp":1729797831000},"page":"155-164","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Towards Adapting CLIP for Gaze Object Prediction"],"prefix":"10.1145","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0009-0006-0783-6881","authenticated-orcid":false,"given":"Dazhi","family":"Chen","sequence":"first","affiliation":[{"name":"Guizhou University, Guiyang, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-2055-2157","authenticated-orcid":false,"given":"Gang","family":"Gou","sequence":"additional","affiliation":[{"name":"Guizhou University, Guiyang, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,10,24]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413683"},{"key":"e_1_2_1_2_1","volume-title":"-Y. M. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934","author":"Bochkovskiy A.","year":"2020","unstructured":"Bochkovskiy, A., Wang, C.-Y., and Liao, H.-Y. M. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020)."},{"key":"e_1_2_1_3_1","first-page":"466","volume-title":"International Conference on Neural Information Processing","author":"Chen D.","year":"2023","unstructured":"Chen, D., and Gou, G. Unleash the capabilities of the vision-language pre-training model in gaze object prediction. In International Conference on Neural Information Processing (2023), Springer, pp. 453\u2013466."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00544"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3206343.3206351"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.169"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.81"},{"key":"e_1_2_1_8_1","first-page":"199","volume-title":"Current trends in eye tracking research","author":"Harwood T.","year":"2013","unstructured":"Harwood, T., and Jones, M. Mobile eye-tracking in retail research. In Current trends in eye tracking research. Springer, 2013, pp. 183\u2013199."},{"key":"e_1_2_1_9_1","first-page":"50","volume-title":"Perth, Australia, December 2\u20136","author":"Lian D.","year":"2018","unstructured":"Lian, D., Yu, Z., and Gao, S. Believe it or not, we know what you are looking at! In Computer Vision\u2013ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2\u20136, 2018, Revised Selected Papers, Part III 14 (2019), Springer, pp. 35\u201350."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00381"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2015.2482819"},{"key":"e_1_2_1_12_1","volume-title":"Augmented saliency model using automatic 3d head pose detection and learned gaze following in natural scenes. Vision research 116","author":"Parks D.","year":"2015","unstructured":"Parks, D., Borji, A., and Itti, L. Augmented saliency model using automatic 3d head pose detection and learned gaze following in natural scenes. Vision research 116 (2015), 113\u2013126."},{"key":"e_1_2_1_13_1","first-page":"8763","volume-title":"International conference on machine learning","author":"Radford A.","year":"2021","unstructured":"Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning (2021), PMLR, pp. 8748\u20138763."},{"key":"e_1_2_1_14_1","volume-title":"Where are they looking? Advances in neural information processing systems 28","author":"Recasens A.","year":"2015","unstructured":"Recasens, A., Khosla, A., Vondrick, C., and Torralba, A. Where are they looking? Advances in neural information processing systems 28 (2015)."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.91"},{"key":"e_1_2_1_16_1","volume-title":"Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28","author":"Ren S.","year":"2015","unstructured":"Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015)."},{"key":"e_1_2_1_17_1","first-page":"26","article-title":"Action is in the eye of the beholder: Eye-gaze driven model for spatio-temporal action localization","author":"Shapovalova N.","year":"2013","unstructured":"Shapovalova, N., Raptis, M., Sigal, L., and Mori, G. Action is in the eye of the beholder: Eye-gaze driven model for spatio-temporal action localization. Advances in Neural Information Processing Systems 26 (2013).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW53098.2021.00349"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3536221.3556624"},{"key":"e_1_2_1_20_1","volume-title":"Joint gaze-location and gaze-object detection. arXiv preprint arXiv:2308.13857","author":"Tu D.","year":"2023","unstructured":"Tu, D., Shen, W., Sun, W., Min, X., and Zhai, G. Joint gaze-location and gaze-object detection. arXiv preprint arXiv:2308.13857 (2023)."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01898"},{"key":"e_1_2_1_22_1","first-page":"6","article-title":"Interact as you intend: Intention-driven human-object interaction detection","volume":"22","author":"Xu B.","year":"2019","unstructured":"Xu, B., Li, J., Wong, Y., Zhao, Q., and Kankanhalli, M. S. Interact as you intend: Intention-driven human-object interaction detection. IEEE Transactions on Multimedia 22, 6 (2019), 1423\u20131432.","journal-title":"IEEE Transactions on Multimedia"}],"container-title":["Proceedings of the ACM on Human-Computer Interaction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3698132","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3698132","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:17:17Z","timestamp":1750295837000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3698132"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10,24]]},"references-count":22,"journal-issue":{"issue":"ISS","published-print":{"date-parts":[[2024,10,24]]}},"alternative-id":["10.1145\/3698132"],"URL":"https:\/\/doi.org\/10.1145\/3698132","relation":{},"ISSN":["2573-0142"],"issn-type":[{"type":"electronic","value":"2573-0142"}],"subject":[],"published":{"date-parts":[[2024,10,24]]},"assertion":[{"value":"2024-10-24","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}