{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,2]],"date-time":"2025-12-02T19:57:02Z","timestamp":1764705422616,"version":"3.46.0"},"reference-count":88,"publisher":"Association for Computing Machinery (ACM)","issue":"4","funder":[{"DOI":"10.13039\/100031278","name":"Department for Science, Innovation and Technology","doi-asserted-by":"crossref","award":["K250071-101"],"award-info":[{"award-number":["K250071-101"]}],"id":[{"id":"10.13039\/100031278","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Jiangsu Basic Research Program","award":["BK20240414"],"award-info":[{"award-number":["BK20240414"]}]},{"name":"Suzhou Industrial Park Leadership Talent Program","award":["KJQ2024204"],"award-info":[{"award-number":["KJQ2024204"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Interact. Mob. Wearable Ubiquitous Technol."],"published-print":{"date-parts":[[2025,12,2]]},"abstract":"<jats:p>\n                    Integrating vision-language models (VLMs) with wearable devices offers great potential for continuous and responsive video understanding, a key capability for applications such as smart eyewear-based conversational assistants. However, achieving this on resource-constrained devices is challenging due to the high energy demands of continuous spatial-temporal sampling and transmission. We propose\n                    <jats:italic toggle=\"yes\">ActiveEye<\/jats:italic>\n                    , a VLM designed for energy-efficient and responsive video understanding.\n                    <jats:italic toggle=\"yes\">ActiveEye<\/jats:italic>\n                    separates visual and motion semantic representations and incorporates an active perception-based feedback path to adaptively adjust spatial-temporal sampling and transmission rates. Implemented as a wearable-mobile-cloud system,\n                    <jats:italic toggle=\"yes\">ActiveEye<\/jats:italic>\n                    is evaluated for energy efficiency, real-time semantic change detection, and video understanding in both laboratory and field studies. Using the EgoSchema dataset,\n                    <jats:italic toggle=\"yes\">ActiveEye<\/jats:italic>\n                    reduces the front-end energy consumption by 49.14%, supporting 8.37 hours of continuous operation on a 2.1 Wh battery. It achieves the highest F1 score (0.80) and the lowest average time difference (1.30 s) compared with heuristic-based event detection algorithms, validating its timely semantic detection. Furthermore,\n                    <jats:italic toggle=\"yes\">ActiveEye<\/jats:italic>\n                    achieves a visual question answering (VQA) accuracy of 61.6%, which is comparable to state-of-the-art VLM agents, despite their reliance on larger language decoders and more computationally intensive frame selection strategies. Two rounds of in-field user evaluations further confirm its effectiveness in real-world settings, demonstrating its practical viability as a continuous and responsive video understanding system, conversational assistant, and wearable companion.\n                  <\/jats:p>","DOI":"10.1145\/3770641","type":"journal-article","created":{"date-parts":[[2025,12,2]],"date-time":"2025-12-02T19:42:32Z","timestamp":1764704552000},"page":"1-33","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["ActiveEye: Enabling Continuous and Responsive Video Understanding for Smart Eyewear Systems"],"prefix":"10.1145","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-4497-724X","authenticated-orcid":false,"given":"Zhenyu","family":"Xu","sequence":"first","affiliation":[{"name":"School of Computer Science, Fudan University, Shanghai, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-0540-2888","authenticated-orcid":false,"given":"Tianlin","family":"Lu","sequence":"additional","affiliation":[{"name":"School of Computer Science, Fudan University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5902-1306","authenticated-orcid":false,"given":"Yingying","family":"Zhao","sequence":"additional","affiliation":[{"name":"Department of Computer and Information Sciences, University of Strathclyde, Glasgow, United Kingdom"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6220-029X","authenticated-orcid":false,"given":"Yujiang","family":"Wang","sequence":"additional","affiliation":[{"name":"Oxford Suzhou Centre for Advanced Research, Suzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8897-5931","authenticated-orcid":false,"given":"Mingzhi","family":"Dong","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Bath, Bath, United Kingdom"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2607-916X","authenticated-orcid":false,"given":"Yuhu","family":"Chang","sequence":"additional","affiliation":[{"name":"School of Computer Science, Fudan University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9437-1376","authenticated-orcid":false,"given":"Qin","family":"Lv","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Colorado Boulder, Boulder, Colorado, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5428-9530","authenticated-orcid":false,"given":"Robert P.","family":"Dick","sequence":"additional","affiliation":[{"name":"Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2164-8175","authenticated-orcid":false,"given":"Fan","family":"Yang","sequence":"additional","affiliation":[{"name":"School of Microelectronics, Fudan University, Shanghai, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6633-4826","authenticated-orcid":false,"given":"Tun","family":"Lu","sequence":"additional","affiliation":[{"name":"School of Computer Science, Fudan University, Shanghai, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2915-974X","authenticated-orcid":false,"given":"Ning","family":"Gu","sequence":"additional","affiliation":[{"name":"School of Computer Science, Fudan University, Shanghai, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3944-7531","authenticated-orcid":false,"given":"Li","family":"Shang","sequence":"additional","affiliation":[{"name":"School of Computer Science, Fudan University, Shanghai, China"}]}],"member":"320","published-online":{"date-parts":[[2025,12,2]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00129"},{"key":"e_1_2_1_2_1","unstructured":"Jean-Baptiste Alayrac Jeff Donahue Pauline Luc Antoine Miech Iain Barr Yana Hasson Karel Lenc Arthur Mensch Katherine Millican Malcolm Reynolds et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35 (2022) 23716\u201323736."},{"key":"e_1_2_1_3_1","volume-title":"Active perception","author":"Aloimonos Yiannis","unstructured":"Yiannis Aloimonos. 2013. Active perception. Psychology Press."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.5555\/324493.324551"},{"key":"e_1_2_1_5_1","volume-title":"Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens. arXiv preprint arXiv:2404.03413","author":"Ataallah Kirolos","year":"2024","unstructured":"Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elhoseiny. 2024. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens. arXiv preprint arXiv:2404.03413 (2024)."},{"key":"e_1_2_1_6_1","volume-title":"Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966","author":"Bai Jinze","year":"2023","unstructured":"Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/5.5968"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10514-017-9615-3"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3290605.3300651"},{"key":"e_1_2_1_10_1","unstructured":"Jean-Yves Bouguet et al. 2001. Pyramidal implementation of the affine lucas kanade feature tracker description of the algorithm. Intel corporation 5 1\u201310 (2001) 4."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2689746.2689748"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3463509"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2023.3274575"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01742"},{"key":"e_1_2_1_15_1","volume-title":"European Conference on Computer Vision. 19\u201335","author":"Chen Liang","year":"2025","unstructured":"Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. 2025. An image is worth 1\/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In European Conference on Computer Vision. 19\u201335."},{"key":"e_1_2_1_16_1","unstructured":"Zhe Chen Weiyun Wang Yue Cao Yangzhou Liu Zhangwei Gao Erfei Cui Jinguo Zhu Shenglong Ye Hao Tian Zhaoyang Liu et al. 2024. Expanding performance boundaries of open-source multimodal models with model data and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-024-4231-5"},{"key":"e_1_2_1_18_1","unstructured":"Xiangxiang Chu Limeng Qiao Xinyu Zhang Shuang Xu Fei Wei Yang Yang Xiaofei Sun Yiming Hu Xinyang Lin Bo Zhang et al. 2024. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766 (2024)."},{"key":"e_1_2_1_19_1","unstructured":"Alibaba Cloud. 2023. Intelligent Speech Interaction for Human-Computer Interaction - Alibaba Cloud \u2014 alibabacloud.com. https:\/\/www.alibabacloud.com\/product\/intelligent-speech-interaction. [Accessed 10-08-2023]."},{"key":"e_1_2_1_20_1","volume-title":"Instructblip: Towards general-purpose vision-language models with instruction tuning. arxiv","author":"Dai Wenliang","year":"2023","unstructured":"Wenliang Dai, Junnan Li, D Li, AMH Tiong, J Zhao, W Wang, B Li, P Fung, and S Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruction tuning. arxiv 2023. arXiv preprint arXiv:2305.06500 2 (2023)."},{"key":"e_1_2_1_21_1","volume-title":"Nonlocal sparse and low-rank regularization for optical flow estimation","author":"Dong Weisheng","year":"2014","unstructured":"Weisheng Dong, Guangming Shi, Xiaocheng Hu, and Yi Ma. 2014. Nonlocal sparse and low-rank regularization for optical flow estimation. IEEE transactions on image processing 23, 10 (2014), 4527\u20134538."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-72670-5_5"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3672539.3686776"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00630"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.213"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3409120.3410651"},{"key":"e_1_2_1_27_1","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision. 6941\u20136950","author":"Gothe Sourabh Vasant","year":"2024","unstructured":"Sourabh Vasant Gothe, Vibhav Agarwal, Sourav Ghosh, Jayesh Rajkumar Vachhani, Pranay Kashyap, and Barath Raj Kandur Raja. 2024. What's in the Flow? Exploiting Temporal Motion Cues for Unsupervised Generic Event Boundary Detection. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision. 6941\u20136950."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01842"},{"key":"e_1_2_1_29_1","volume-title":"2024 IEEE\/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 129\u2013134","author":"Hao Yu","year":"2024","unstructured":"Yu Hao, Alexey Magay, Hao Huang, Shuaihang Yuan, Congcong Wen, and Yi Fang. 2024. ChatMap: A Wearable Platform Based on the Multi-modal Foundation Model to Augment Spatial Cognition for People with Blindness and Low Vision. In 2024 IEEE\/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 129\u2013134."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_31_1","volume-title":"LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model. arXiv preprint arXiv:2404.01331","author":"Hinck Musashi","year":"2024","unstructured":"Musashi Hinck, Matthew L Olson, David Cobbley, Shao-Yen Tseng, and Vasudev Lal. 2024. LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model. arXiv preprint arXiv:2404.01331 (2024)."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2632048.2632079"},{"key":"e_1_2_1_33_1","doi-asserted-by":"crossref","first-page":"e2115302119","DOI":"10.1073\/pnas.2115302119","article-title":"Texture-like representation of objects in human visual cortex","volume":"119","author":"Jagadeesh Akshay V","year":"2022","unstructured":"Akshay V Jagadeesh and Justin L Gardner. 2022. Texture-like representation of objects in human visual cortex. Proceedings of the National Academy of Sciences 119, 17 (2022), e2115302119.","journal-title":"Proceedings of the National Academy of Sciences"},{"key":"e_1_2_1_34_1","volume-title":"Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization. In International Conference on Machine Learning. 22185\u201322209","author":"Jin Yang","year":"2024","unstructured":"Yang Jin, Zhicheng Sun, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, and Yadong Mu. 2024. Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization. In International Conference on Machine Learning. 22185\u201322209."},{"key":"e_1_2_1_35_1","volume-title":"Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization. In International Conference on Learning Representations.","author":"Jin Yang","year":"2024","unstructured":"Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Yadong Mu, et al. 2024. Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization. In International Conference on Learning Representations."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240167.3240174"},{"key":"e_1_2_1_37_1","volume-title":"Gazegpt: Augmenting human capabilities using gaze-contingent contextual ai for smart eyewear. arXiv preprint arXiv:2401.17217","author":"Konrad Robert","year":"2024","unstructured":"Robert Konrad, Nitish Padmanaban, J Gabriel Buckmaster, Kevin C Boyle, and Gordon Wetzstein. 2024. Gazegpt: Augmenting human capabilities using gaze-contingent contextual ai for smart eyewear. arXiv preprint arXiv:2401.17217 (2024)."},{"key":"e_1_2_1_38_1","volume-title":"Proceedings of the 13th annual ACM international conference on Multimedia. 229\u2013238","author":"Kulkarni Purushottam","year":"2005","unstructured":"Purushottam Kulkarni, Deepak Ganesan, Prashant Shenoy, and Qifeng Lu. 2005. SensEye: a multi-tier camera sensor network. In Proceedings of the 13th annual ACM international conference on Multimedia. 229\u2013238."},{"key":"e_1_2_1_39_1","volume-title":"Sigchi Conference on Human Factors in Computing Systems.","author":"Law Lai Chong","year":"2009","unstructured":"Lai Chong Law, Virpi Roto, Marc Hassenzahl, Arnold P. O. S. Vermeeren, and Joke Kort. 2009. Understanding, scoping and defining user experience: a survey approach. In Sigchi Conference on Human Factors in Computing Systems."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3605390.3605400"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1080\/10447318.2018.1455307"},{"key":"e_1_2_1_42_1","article-title":"Item benchmarks for the system usability scale","volume":"13","author":"Lewis James R","year":"2018","unstructured":"James R Lewis and Jeff Sauro. 2018. Item benchmarks for the system usability scale. Journal of Usability studies 13, 3 (2018).","journal-title":"Journal of Usability studies"},{"key":"e_1_2_1_43_1","volume-title":"Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326","author":"Li Bo","year":"2024","unstructured":"Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. 2024. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)."},{"key":"e_1_2_1_44_1","volume-title":"International conference on machine learning. PMLR","author":"Li Junnan","year":"2023","unstructured":"Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730\u201319742."},{"key":"e_1_2_1_45_1","volume-title":"Proceeding of the 11th annual international conference on Mobile systems, applications, and services. 69\u201382.","author":"LiKamWa Robert","unstructured":"Robert LiKamWa, Bodhi Priyantha, Matthai Philipose, Lin Zhong, and Paramvir Bahl. 2013. Energy characterization and optimization of image sensing toward continuous mobile vision. In Proceeding of the 11th annual international conference on Mobile systems, applications, and services. 69\u201382."},{"key":"e_1_2_1_46_1","volume-title":"Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947","author":"Lin Bin","year":"2024","unstructured":"Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. 2024. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947 (2024)."},{"key":"e_1_2_1_47_1","first-page":"7575","article-title":"Egocentric video-language pretraining","volume":"35","author":"Lin Kevin Qinghong","year":"2022","unstructured":"Kevin Qinghong Lin, Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Z Xu, Difei Gao, Rong-Cheng Tu, Wenzhe Zhao, Weijie Kong, et al. 2022. Egocentric video-language pretraining. Advances in Neural Information Processing Systems 35 (2022), 7575\u20137586.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02484"},{"key":"e_1_2_1_49_1","volume-title":"2019 Data Compression Conference (DCC). IEEE, 478\u2013487","author":"Lubana Ekdeep Singh","year":"2019","unstructured":"Ekdeep Singh Lubana, Vinayak Aggarwal, and Robert P Dick. 2019. Machine Foveation: An application-aware compressive sensing framework. In 2019 Data Compression Conference (DCC). IEEE, 478\u2013487."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2018.2858340"},{"key":"e_1_2_1_51_1","volume-title":"Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424","author":"Maaz Muhammad","year":"2023","unstructured":"Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. 2023. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 (2023)."},{"key":"e_1_2_1_52_1","first-page":"46212","article-title":"Egoschema: A diagnostic benchmark for very long-form video language understanding","volume":"36","author":"Mangalam Karttikeya","year":"2023","unstructured":"Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. 2023. Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems 36 (2023), 46212\u201346244.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_53_1","volume-title":"Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM","year":"1981","unstructured":"Martin, A., Fischler, Robert, C., and Bolles. 1981. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM (1981)."},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01257"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-662-48051-9_17"},{"key":"e_1_2_1_56_1","volume-title":"Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748","author":"van den Oord Aaron","year":"2018","unstructured":"Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)."},{"key":"e_1_2_1_57_1","volume-title":"Multimodal open-vocabulary video classification via pre-trained vision and language models. arXiv preprint arXiv:2207.07646","author":"Qian Rui","year":"2022","unstructured":"Rui Qian, Yeqing Li, Zheng Xu, Ming-Hsuan Yang, Serge Belongie, and Yin Cui. 2022. Multimodal open-vocabulary video classification via pre-trained vision and language models. arXiv preprint arXiv:2207.07646 (2022)."},{"key":"e_1_2_1_58_1","volume-title":"International conference on machine learning. PMLR, 8748\u20138763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748\u20138763."},{"key":"e_1_2_1_59_1","volume-title":"International conference on machine learning. PMLR, 28492\u201328518","author":"Radford Alec","year":"2023","unstructured":"Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International conference on machine learning. PMLR, 28492\u201328518."},{"key":"e_1_2_1_60_1","volume-title":"Yong Jae Lee, and Yan Yan","author":"Shang Yuzhang","year":"2024","unstructured":"Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. 2024. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. arXiv preprint arXiv:2403.15388 (2024)."},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00081"},{"key":"e_1_2_1_62_1","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision. 8075\u20138084","author":"Shou Mike Zheng","year":"2021","unstructured":"Mike Zheng Shou, Stan Weixian Lei, Weiyao Wang, Deepti Ghadiyaram, and Matt Feiszli. 2021. Generic event boundary detection: A benchmark for event segmentation. In Proceedings of the IEEE\/CVF international conference on computer vision. 8075\u20138084."},{"key":"e_1_2_1_63_1","volume-title":"Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems 27","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems 27 (2014)."},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58536-5_24"},{"key":"e_1_2_1_65_1","doi-asserted-by":"crossref","first-page":"1036","DOI":"10.1038\/s41433-023-02842-z","article-title":"Meta smart glasses\u2014large language models and the future for assistive glasses for individuals with vision impairments","volume":"38","author":"Waisberg Ethan","year":"2024","unstructured":"Ethan Waisberg, Joshua Ong, Mouayad Masalkhi, Nasif Zaman, Prithul Sarker, Andrew G Lee, and Alireza Tavakkoli. 2024. Meta smart glasses\u2014large language models and the future for assistive glasses for individuals with vision impairments. Eye 38, 6 (2024), 1036\u20131038.","journal-title":"Eye"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46484-8_2"},{"key":"e_1_2_1_67_1","volume-title":"Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079","author":"Wang Weihan","year":"2023","unstructured":"Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. 2023. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023)."},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-72989-8_4"},{"key":"e_1_2_1_69_1","first-page":"417","article-title":"A variable resolution feedback improving the performances of object detection and recognition","volume":"232","author":"Wang Zihan","year":"2018","unstructured":"Zihan Wang, Qun Hao, Fanghua Zhang, Yao Hu, and Jie Cao. 2018. A variable resolution feedback improving the performances of object detection and recognition. Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering 232, 4 (2018), 417\u2013427.","journal-title":"Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering"},{"key":"e_1_2_1_70_1","volume-title":"VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos. arXiv preprint arXiv:2405.19209","author":"Wang Ziyang","year":"2024","unstructured":"Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. 2024. VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos. arXiv preprint arXiv:2405.19209 (2024)."},{"key":"e_1_2_1_71_1","volume-title":"MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding. arXiv preprint arXiv:2411.17762","author":"Xie Rongchang","year":"2024","unstructured":"Rongchang Xie, Chen Du, Ping Song, and Chang Liu. 2024. MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding. arXiv preprint arXiv:2411.17762 (2024)."},{"key":"e_1_2_1_72_1","volume-title":"See Kiong Ng, and Jiashi Feng","author":"Xu Lin","year":"2024","unstructured":"Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. 2024. Pllava: Parameter-free llava extension from images to videos for video dense captioning. arXiv preprint arXiv:2404.16994 (2024)."},{"key":"e_1_2_1_73_1","volume-title":"Slowfast-llava: A strong training-free baseline for video large language models. arXiv preprint arXiv:2407.15841","author":"Xu Mingze","year":"2024","unstructured":"Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, and Afshin Dehghan. 2024. Slowfast-llava: A strong training-free baseline for video large language models. arXiv preprint arXiv:2407.15841 (2024)."},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1145\/3659600"},{"key":"e_1_2_1_75_1","volume-title":"DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models. arXiv preprint arXiv:2405.20985","author":"Yao Linli","year":"2024","unstructured":"Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu Sun, and Lu Hou. 2024. DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models. arXiv preprint arXiv:2405.20985 (2024)."},{"key":"e_1_2_1_76_1","unstructured":"Qinghao Ye Haiyang Xu Guohai Xu Jiabo Ye Ming Yan Yiyang Zhou Junyang Wang Anwen Hu Pengcheng Shi Yaya Shi et al. 2023. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)."},{"key":"e_1_2_1_77_1","volume-title":"Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. arXiv preprint arXiv:2409.10197","author":"Ye Weihao","year":"2024","unstructured":"Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. 2024. Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. arXiv preprint arXiv:2409.10197 (2024)."},{"key":"e_1_2_1_78_1","volume-title":"A survey on multimodal large language models. arXiv preprint arXiv:2306.13549","author":"Yin Shukang","year":"2023","unstructured":"Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2023. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549 (2023)."},{"key":"e_1_2_1_79_1","volume-title":"Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based Image Text Interaction. arXiv preprint arXiv:2409.01162","author":"Yu Gaotong","year":"2024","unstructured":"Gaotong Yu, Yi Chen, and Jian Xu. 2024. Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based Image Text Interaction. arXiv preprint arXiv:2409.01162 (2024)."},{"key":"e_1_2_1_80_1","volume-title":"Tinygpt-v: Efficient multimodal large language model via small backbones. arXiv preprint arXiv:2312.16862","author":"Yuan Zhengqing","year":"2023","unstructured":"Zhengqing Yuan, Zhaoxu Li, and Lichao Sun. 2023. Tinygpt-v: Efficient multimodal large language model via small backbones. arXiv preprint arXiv:2312.16862 (2023)."},{"key":"e_1_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01100"},{"key":"e_1_2_1_82_1","volume-title":"Empowering Smart Glasses with Large Language Models: Towards Ubiquitous AGI. In Companion of the 2024 on ACM International Joint Conference on Pervasive and Ubiquitous Computing. 631\u2013633","author":"Zhang Dell","year":"2024","unstructured":"Dell Zhang, Yongxiang Li, Zhongjiang He, and Xuelong Li. 2024. Empowering Smart Glasses with Large Language Models: Towards Ubiquitous AGI. In Companion of the 2024 on ACM International Joint Conference on Pervasive and Ubiquitous Computing. 631\u2013633."},{"key":"e_1_2_1_83_1","volume-title":"Make VLM Inference Faster. arXiv preprint arXiv:2412.01818","author":"Zhang Qizhe","year":"2024","unstructured":"Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. 2024. [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster. arXiv preprint arXiv:2412.01818 (2024)."},{"key":"e_1_2_1_84_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3517250","article-title":"Do smart glasses dream of sentimental visions? Deep emotionship analysis for eyewear devices","volume":"6","author":"Zhao Yingying","year":"2022","unstructured":"Yingying Zhao, Yuhu Chang, Yutian Lu, Yujiang Wang, Mingzhi Dong, Qin Lv, Robert P Dick, Fan Yang, Tun Lu, Ning Gu, et al. 2022. Do smart glasses dream of sentimental visions? Deep emotionship analysis for eyewear devices. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 1 (2022), 1\u201329.","journal-title":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"},{"key":"e_1_2_1_85_1","doi-asserted-by":"crossref","first-page":"2150","DOI":"10.1109\/TMM.2021.3076612","article-title":"A reinforcement-learning-based energy-efficient framework for multi-task video analytics pipeline","volume":"24","author":"Zhao Yingying","year":"2021","unstructured":"Yingying Zhao, Mingzhi Dong, Yujiang Wang, Da Feng, Qin Lv, Robert P Dick, Dongsheng Li, Tun Lu, Ning Gu, and Li Shang. 2021. A reinforcement-learning-based energy-efficient framework for multi-task video analytics pipeline. IEEE Transactions on Multimedia 24 (2021), 2150\u20132163.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_2_1_86_1","volume-title":"Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581","author":"Zhao Yang","year":"2023","unstructured":"Yang Zhao, Zhijie Lin, Daquan Zhou, Zilong Huang, Jiashi Feng, and Bingyi Kang. 2023. Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581 (2023)."},{"key":"e_1_2_1_87_1","volume-title":"Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592","author":"Zhu Deyao","year":"2023","unstructured":"Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)."},{"key":"e_1_2_1_88_1","volume-title":"LLaVA-phi: Efficient Multi-Modal Assistant with Small Language Model. arXiv preprint arXiv:2401.02330","author":"Zhu Yichen","year":"2024","unstructured":"Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, and Jian Tang. 2024. LLaVA-phi: Efficient Multi-Modal Assistant with Small Language Model. arXiv preprint arXiv:2401.02330 (2024)."}],"container-title":["Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3770641","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,2]],"date-time":"2025-12-02T19:52:47Z","timestamp":1764705167000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3770641"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,2]]},"references-count":88,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2025,12,2]]}},"alternative-id":["10.1145\/3770641"],"URL":"https:\/\/doi.org\/10.1145\/3770641","relation":{},"ISSN":["2474-9567"],"issn-type":[{"value":"2474-9567","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12,2]]},"assertion":[{"value":"2025-12-02","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}