{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,6]],"date-time":"2026-06-06T16:02:04Z","timestamp":1780761724420,"version":"3.54.1"},"reference-count":84,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2024,5,13]],"date-time":"2024-05-13T00:00:00Z","timestamp":1715558400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Interact. Mob. Wearable Ubiquitous Technol."],"published-print":{"date-parts":[[2024,5,13]]},"abstract":"<jats:p>Modern information querying systems are progressively incorporating multimodal inputs like vision and audio. However, the integration of gaze --- a modality deeply linked to user intent and increasingly accessible via gaze-tracking wearables --- remains underexplored. This paper introduces a novel gaze-facilitated information querying paradigm, named G-VOILA, which synergizes users' gaze, visual field, and voice-based natural language queries to facilitate a more intuitive querying process. In a user-enactment study involving 21 participants in 3 daily scenarios (p = 21, scene = 3), we revealed the ambiguity in users' query language and a gaze-voice coordination pattern in users' natural query behaviors with G-VOILA. Based on the quantitative and qualitative findings, we developed a design framework for the G-VOILA paradigm, which effectively integrates the gaze data with the in-situ querying context. Then we implemented a G-VOILA proof-of-concept using cutting-edge deep learning techniques. A follow-up user study (p = 16, scene = 2) demonstrates its effectiveness by achieving both higher objective score and subjective score, compared to a baseline without gaze data. We further conducted interviews and provided insights for future gaze-facilitated information querying systems.<\/jats:p>","DOI":"10.1145\/3659623","type":"journal-article","created":{"date-parts":[[2024,5,15]],"date-time":"2024-05-15T12:20:41Z","timestamp":1715775641000},"page":"1-33","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":21,"title":["G-VOILA: Gaze-Facilitated Information Querying in Daily Scenarios"],"prefix":"10.1145","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0009-0007-5048-1665","authenticated-orcid":false,"given":"Zeyu","family":"Wang","sequence":"first","affiliation":[{"name":"Key Laboratory of Pervasive Computing, Ministry of Education, Department of Computer Science and Technology, Tsinghua University, Haidian Qu, Beijing Shi, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2273-6927","authenticated-orcid":false,"given":"Yuanchun","family":"Shi","sequence":"additional","affiliation":[{"name":"Key Laboratory of Pervasive Computing, Ministry of Education, Department of Computer Science and Technology, Tsinghua University, Haidian Qu, Beijing Shi, China and Intelligent Computing and Application Laboratory of Qinghai Province, Qinghai University, Xining, Qinghai, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4249-8893","authenticated-orcid":false,"given":"Yuntao","family":"Wang","sequence":"additional","affiliation":[{"name":"Key Laboratory of Pervasive Computing, Ministry of Education, Department of Computer Science and Technology, Tsinghua University, Haidian Qu, Beijing Shi, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-7954-7372","authenticated-orcid":false,"given":"Yuchen","family":"Yao","sequence":"additional","affiliation":[{"name":"Tsinghua University, Haidian Qu, Beijing Shi, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8290-5169","authenticated-orcid":false,"given":"Kun","family":"Yan","sequence":"additional","affiliation":[{"name":"Microsoft Research Asia, Haidian Qu, Beijing Shi, China and SKLSDE Lab, Beihang University, Haidian Qu, Beijing Shi, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-5139-1267","authenticated-orcid":false,"given":"Yuhan","family":"Wang","sequence":"additional","affiliation":[{"name":"Beijing University of Posts and Telecommunications, Haidian Qu, Beijing Shi, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7569-3265","authenticated-orcid":false,"given":"Lei","family":"Ji","sequence":"additional","affiliation":[{"name":"Microsoft Research Asia, Haidian Qu, Beijing Shi, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5930-3899","authenticated-orcid":false,"given":"Xuhai","family":"Xu","sequence":"additional","affiliation":[{"name":"Massachusetts Institute of Technology, Cambridge, Massachusetts, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2591-7993","authenticated-orcid":false,"given":"Chun","family":"Yu","sequence":"additional","affiliation":[{"name":"Tsinghua University, Haidian Qu, Beijing Shi, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2024,5,15]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"[n. d.]. Microsoft Bing. https:\/\/www.bing.com\/. Accessed: 2023-09-07."},{"key":"e_1_2_1_2_1","volume-title":"2016 AAAI Fall Symposium Series.","author":"Admoni Henny","year":"2016","unstructured":"Henny Admoni and Siddhartha Srinivasa. 2016. Predicting user intent through eye gaze for shared autonomy. In 2016 AAAI Fall Symposium Series."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/MLSP.2010.5589228"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11257-009-9066-4"},{"key":"e_1_2_1_5_1","first-page":"23716","article-title":"Flamingo: a visual language model for few-shot learning","volume":"35","author":"Alayrac Jean-Baptiste","year":"2022","unstructured":"Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35 (2022), 23716--23736.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_6_1","unstructured":"Alibaba. 2023. Tongyi Qianwen. (2023)."},{"key":"e_1_2_1_7_1","volume-title":"Acm sigir forum","author":"Allan James","unstructured":"James Allan, Bruce Croft, Alistair Moffat, and Mark Sanderson. 2012. Frontiers, challenges, and opportunities for information retrieval: Report from SWIRL 2012 the second strategic workshop on information retrieval in Lorne. In Acm sigir forum, Vol. 46. ACM New York, NY, USA, 2--32."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3025453.3026033"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.5281\/zenodo.7733589"},{"key":"e_1_2_1_10_1","unstructured":"Ricardo Baeza-Yates Berthier Ribeiro-Neto et al. 1999. Modern information retrieval. Vol. 463. ACM press New York."},{"key":"e_1_2_1_11_1","unstructured":"Baidu. 2023. ERNIE Bot: Enhanced Representation through Knowledge Integration. (2023)."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/800250.807503"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1620545.1620552"},{"key":"e_1_2_1_14_1","volume-title":"Demonstrating Reality-Based Information Retrieval. In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems. 1--4.","author":"B\u00fcschel Wolfgang","year":"2018","unstructured":"Wolfgang B\u00fcschel, Annett Mitschick, and Raimund Dachselt. 2018. Demonstrating Reality-Based Information Retrieval. In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems. 1--4."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3176349.3176384"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3442381.3450127"},{"key":"e_1_2_1_17_1","volume-title":"Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic. arXiv preprint arXiv:2306.15195","author":"Chen Keqin","year":"2023","unstructured":"Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. 2023. Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic. arXiv preprint arXiv:2306.15195 (2023)."},{"key":"e_1_2_1_18_1","volume-title":"ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models. https:\/\/openreview.net\/forum?id=kdHpWogtX6Y","author":"Chen Liangyu","year":"2023","unstructured":"Liangyu Chen, Bo Li, Sheng Shen, Jingkang Yang, Chunyuan Li, Kurt Keutzer, Trevor Darrell, and Ziwei Liu. 2023. Language Models are Visual Reasoning Coordinators. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models. https:\/\/openreview.net\/forum?id=kdHpWogtX6Y"},{"key":"e_1_2_1_19_1","volume-title":"Xing","author":"Chiang Wei-Lin","year":"2023","unstructured":"Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https:\/\/vicuna.lmsys.org"},{"key":"e_1_2_1_20_1","volume-title":"International Conference on Machine Learning. PMLR","author":"Cho Jaemin","year":"2021","unstructured":"Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. 2021. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning. PMLR, 1931--1942."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/2858036.2858499"},{"key":"e_1_2_1_22_1","volume-title":"The promise of immersive learning: Augmented and virtual reality's potential in education","author":"Dick Ellysse","year":"2021","unstructured":"Ellysse Dick. 2021. The promise of immersive learning: Augmented and virtual reality's potential in education. Information Technology and Innovation Foundation (2021)."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544549.3585790"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ivs.2009.5164397"},{"key":"e_1_2_1_25_1","volume-title":"Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al.","author":"Driess Danny","year":"2023","unstructured":"Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. 2023. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544549.3585853"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544549.3585853"},{"key":"e_1_2_1_28_1","unstructured":"Peng Gao Jiaming Han Renrui Zhang Ziyi Lin Shijie Geng Aojun Zhou Wei Zhang Pan Lu Conghui He Xiangyu Yue et al. 2023. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model. arXiv preprint arXiv:2304.15010 (2023)."},{"key":"e_1_2_1_29_1","first-page":"1","article-title":"MMTSA: Multi-Modal Temporal Segment Attention Network for Efficient Human Activity Recognition","volume":"7","author":"Gao Ziqi","year":"2023","unstructured":"Ziqi Gao, Yuntao Wang, Jianguo Chen, Junliang Xing, Shwetak Patel, Xin Liu, and Yuanchun Shi. 2023. MMTSA: Multi-Modal Temporal Segment Attention Network for Efficient Human Activity Recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 7, 3 (2023), 1--26.","journal-title":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.18637\/jss.v031.i07"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.3389\/fpsyg.2015.01049"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298710"},{"key":"e_1_2_1_33_1","volume-title":"Damien Jose, and Xiang Ren.","author":"Jin Woojeong","year":"2023","unstructured":"Woojeong Jin, Subhabrata Mukherjee, Yu Cheng, Yelong Shen, Weizhu Chen, Ahmed Hassan Awadallah, Damien Jose, and Xiang Ren. 2023. GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions. arXiv preprint arXiv:2305.14676 (2023)."},{"key":"e_1_2_1_34_1","doi-asserted-by":"crossref","unstructured":"Alexander Kirillov Eric Mintun Nikhila Ravi Hanzi Mao Chloe Rolland Laura Gustafson Tete Xiao Spencer Whitehead Alexander C. Berg Wan-Yen Lo Piotr Doll\u00e1r and Ross Girshick. 2023. Segment Anything. arXiv:2304.02643 [cs.CV]","DOI":"10.1109\/ICCV51070.2023.00371"},{"key":"e_1_2_1_35_1","volume-title":"Eye movements and the control of actions in everyday life. Progress in retinal and eye research 25, 3","author":"Land Michael F","year":"2006","unstructured":"Michael F Land. 2006. Eye movements and the control of actions in everyday life. Progress in retinal and eye research 25, 3 (2006), 296--324."},{"key":"e_1_2_1_36_1","volume-title":"Otter: A Multi-Modal Model with In-Context Instruction Tuning. arXiv preprint arXiv:2305.03726","author":"Li Bo","year":"2023","unstructured":"Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. 2023. Otter: A Multi-Modal Model with In-Context Instruction Tuning. arXiv preprint arXiv:2305.03726 (2023)."},{"key":"e_1_2_1_37_1","volume-title":"Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597","author":"Li Junnan","year":"2023","unstructured":"Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-981-16-7213-2_8"},{"key":"e_1_2_1_39_1","volume-title":"Labeling out-of-view objects in immersive analytics to support situated visual searching","author":"Lin Tica","year":"2021","unstructured":"Tica Lin, Yalong Yang, Johanna Beyer, and Hanspeter Pfister. 2021. Labeling out-of-view objects in immersive analytics to support situated visual searching. IEEE Transactions on Visualization and Computer Graphics (2021)."},{"key":"e_1_2_1_40_1","volume-title":"Visual instruction tuning. arXiv preprint arXiv:2304.08485","author":"Liu Haotian","year":"2023","unstructured":"Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3631429"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2022.04.080"},{"key":"e_1_2_1_43_1","unstructured":"J\u00e9r\u00f4me Louradour. 2023. whisper-timestamped. https:\/\/github.com\/linto-ai\/whisper-timestamped."},{"key":"e_1_2_1_44_1","volume-title":"Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.","author":"Minderer Matthias","year":"2022","unstructured":"Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. 2022. Simple Open-Vocabulary Object Detection with Vision Transformers. arXiv:2205.06230 [cs.CV]"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-29384-0_17"},{"key":"e_1_2_1_46_1","unstructured":"OpenAI. 2023. GPT-4 Technical Report. (2023)."},{"key":"e_1_2_1_47_1","unstructured":"OpenAI. 2023. Introducing ChatGPT. (2023)."},{"key":"e_1_2_1_48_1","unstructured":"Zhiliang Peng Wenhui Wang Li Dong Yaru Hao Shaohan Huang Shuming Ma and Furu Wei. 2023. Kosmos-2: Grounding Multimodal Large Language Models to the World. arXiv:2306.14824 [cs.CL]"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.85"},{"key":"e_1_2_1_50_1","volume-title":"Information Systems and Neuroscience: NeuroIS Retreat","author":"Perkhofer Lisa","year":"2018","unstructured":"Lisa Perkhofer and Othmar Lehner. 2019. Using gaze behavior to measure cognitive load. In Information Systems and Neuroscience: NeuroIS Retreat 2018. Springer, 73--83."},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/2807442.2807460"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-85623-6_32"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3491207"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58558-7_38"},{"key":"e_1_2_1_55_1","volume-title":"International Conference on Machine Learning. PMLR, 28492--28518","author":"Radford Alec","year":"2023","unstructured":"Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning. PMLR, 28492--28518."},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1147\/sj.393.0685"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2020.01.028"},{"key":"e_1_2_1_58_1","unstructured":"SenseTime. 2023. Sense Nova. (2023)."},{"key":"e_1_2_1_59_1","volume-title":"Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580","author":"Shen Yongliang","year":"2023","unstructured":"Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580 (2023)."},{"key":"e_1_2_1_60_1","volume-title":"Co-Saliency Detection and Video Salient Object Detection","author":"Su Yukun","year":"2023","unstructured":"Yukun Su, Jingliang Deng, Ruizhou Sun, Guosheng Lin, and Qingyao Wu. 2023. A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection. IEEE Transactions on Multimedia (2023)."},{"key":"e_1_2_1_61_1","volume-title":"Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128","author":"Sur\u00eds D\u00eddac","year":"2023","unstructured":"D\u00eddac Sur\u00eds, Sachit Menon, and Carl Vondrick. 2023. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128 (2023)."},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/332040.332443"},{"key":"e_1_2_1_63_1","volume-title":"Hashimoto","author":"Taori Rohan","year":"2023","unstructured":"Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https:\/\/github.com\/tatsu-lab\/stanford_alpaca."},{"key":"e_1_2_1_64_1","volume-title":"Chris Kay Baumann, and Kai Dierkes","author":"Tonsen Marc","year":"2020","unstructured":"Marc Tonsen, Chris Kay Baumann, and Kai Dierkes. 2020. A High-Level Description and Performance Evaluation of Pupil Invisible. arXiv preprint arXiv:2009.00508 (2020)."},{"key":"e_1_2_1_65_1","volume-title":"LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971 (2023)."},{"key":"e_1_2_1_66_1","volume-title":"International Conference on Machine Learning. PMLR, 23318--23340","author":"Wang Peng","year":"2022","unstructured":"Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning. PMLR, 23318--23340."},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544548.3581425"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/3491102.3517698"},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544548.3581042"},{"key":"e_1_2_1_70_1","volume-title":"Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671","author":"Wu Chenfei","year":"2023","unstructured":"Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. 2023. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023)."},{"key":"e_1_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544548.3581500"},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1145\/3381011"},{"key":"e_1_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-long.157"},{"key":"e_1_2_1_74_1","unstructured":"Kun Yan Lei Ji Zeyu Wang Yuntao Wang Nan Duan and Shuai Ma. 2023. Voila-A: Aligning Vision-Language Models with User's Gaze Attention. arXiv:2401.09454 [cs.CV]"},{"key":"e_1_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2023.3247085"},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1145\/3488560.3502194"},{"key":"e_1_2_1_77_1","volume-title":"Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381","author":"Yang Zhengyuan","year":"2023","unstructured":"Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. 2023. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023)."},{"key":"e_1_2_1_78_1","volume-title":"PEVL: Position-enhanced pre-training and prompt tuning for vision-language models. arXiv preprint arXiv:2205.11169","author":"Yao Yuan","year":"2022","unstructured":"Yuan Yao, Qianyu Chen, Ao Zhang, Wei Ji, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. 2022. PEVL: Position-enhanced pre-training and prompt tuning for vision-language models. arXiv preprint arXiv:2205.11169 (2022)."},{"key":"e_1_2_1_79_1","unstructured":"Belinda Zeng. 2022. Go beyond the search box: Introducing multisearch. https:\/\/blog.google\/products\/search\/multisearch\/"},{"key":"e_1_2_1_80_1","volume-title":"Gradient-Induced Co-Saliency Detection. In European Conference on Computer Vision (ECCV).","author":"Zhang Zhao","year":"2020","unstructured":"Zhao Zhang, Wenda Jin, Jun Xu, and Ming-Ming Cheng. 2020. Gradient-Induced Co-Saliency Detection. In European Conference on Computer Vision (ECCV)."},{"key":"e_1_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01629"},{"key":"e_1_2_1_82_1","volume-title":"RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension. arXiv preprint arXiv:2308.02299","author":"Zhou Qiang","year":"2023","unstructured":"Qiang Zhou, Chaohui Yu, Shaofeng Zhang, Sitong Wu, Zhibing Wang, and Fan Wang. 2023. RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension. arXiv preprint arXiv:2308.02299 (2023)."},{"key":"e_1_2_1_83_1","volume-title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592","author":"Zhu Deyao","year":"2023","unstructured":"Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592 (2023)."},{"key":"e_1_2_1_84_1","unstructured":"Xueyan Zou Zi-Yi Dou Jianwei Yang Zhe Gan Linjie Li Chunyuan Li Xiyang Dai Harkirat Behl Jianfeng Wang Lu Yuan et al. 2022. Generalized Decoding for Pixel Image and Language. arXiv preprint arXiv:2212.11270 (2022)."}],"container-title":["Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3659623","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3659623","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T17:01:31Z","timestamp":1755882091000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3659623"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,13]]},"references-count":84,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,5,13]]}},"alternative-id":["10.1145\/3659623"],"URL":"https:\/\/doi.org\/10.1145\/3659623","relation":{},"ISSN":["2474-9567"],"issn-type":[{"value":"2474-9567","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,5,13]]},"assertion":[{"value":"2024-05-15","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}