{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,19]],"date-time":"2026-03-19T06:29:54Z","timestamp":1773901794063,"version":"3.50.1"},"reference-count":89,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2024,11,21]],"date-time":"2024-11-21T00:00:00Z","timestamp":1732147200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Interact. Mob. Wearable Ubiquitous Technol."],"published-print":{"date-parts":[[2024,11,21]]},"abstract":"<jats:p>Cameras are ubiquitous in society, with users increasingly looking to extract insights about the physical world. Current human-to-camera interaction methods, while advanced, still need to support an intuitive, conversational interaction as one would expect in human-to-human communication. To achieve a more natural interaction between humans and cameras, we proposed a novel contextual chatting-to-camera paradigm. This paradigm allows users to interact with the camera using natural language including raising interests and questions. In response, the camera can customize specific tasks tailored to these interests and attempt to provide answers to the questions asked. We designed ChatCam, embracing LLMs for contextual chatting-to-camera with interest-oriented video summarization. With a novel prompt with the actor-critic LLMs approach, ChatCam can understand users' interests and translate them into some tasks and objects. ChatCam can also customize relevant models with the help of the multi-modal large language model and deep reinforcement learning on the resource-constrained edge and maintain high accuracy. Results show that ChatCam achieves an improvement up to 43.9% in understanding user interests and 21.1% in model accuracy compared to state-of-the-art methods in multiple settings. Various examples and the user study also prove the effectiveness of ChatCam in practice.<\/jats:p>","DOI":"10.1145\/3699731","type":"journal-article","created":{"date-parts":[[2024,11,21]],"date-time":"2024-11-21T12:23:32Z","timestamp":1732191812000},"page":"1-34","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["ChatCam: Embracing LLMs for Contextual Chatting-to-Camera with Interest-Oriented Video Summarization"],"prefix":"10.1145","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0009-0006-7279-7997","authenticated-orcid":false,"given":"Kaijie","family":"Xiao","sequence":"first","affiliation":[{"name":"College of Computer Science, Zhejiang University, Hangzhou, Zhejiang, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7897-5965","authenticated-orcid":false,"given":"Yi","family":"Gao","sequence":"additional","affiliation":[{"name":"College of Computer Science, Zhejiang University, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-6561-4285","authenticated-orcid":false,"given":"Fu","family":"Li","sequence":"additional","affiliation":[{"name":"College of Computer Science, Zhejiang University, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-6519-2335","authenticated-orcid":false,"given":"Weifeng","family":"Xu","sequence":"additional","affiliation":[{"name":"School of Software Technology, Zhejiang University, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-5449-0136","authenticated-orcid":false,"given":"Pengzhi","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Software Technology, Zhejiang University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0498-1494","authenticated-orcid":false,"given":"Wei","family":"Dong","sequence":"additional","affiliation":[{"name":"College of Computer Science, Zhejiang University, China"}]}],"member":"320","published-online":{"date-parts":[[2024,11,21]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2021. all-mpnet-base-v2. https:\/\/huggingface.co\/sentence-transformers\/all-mpnet-base-v2."},{"key":"e_1_2_1_2_1","unstructured":"2023. Chroma. https:\/\/github.com\/chroma-core\/chroma."},{"key":"e_1_2_1_3_1","unstructured":"2023. Evaluating the ideal chunk size for a RAG system using LLaMaindex. https:\/\/www.llamaindex.ai."},{"key":"e_1_2_1_4_1","unstructured":"2023. Gemini. https:\/\/deepmind.google\/technologies\/gemini."},{"key":"e_1_2_1_5_1","unstructured":"2023. GPT-4V(ison). https:\/\/openai.com\/research\/gpt-4v-system-card."},{"key":"e_1_2_1_6_1","unstructured":"2023. Jetson Xavier NX. https:\/\/www.nvidia.cn\/autonomous-machines\/embedded-systems\/jetson-xavier-nx\/."},{"key":"e_1_2_1_7_1","unstructured":"2023. Qdrant. https:\/\/github.com\/qdrant\/qdrant."},{"key":"e_1_2_1_8_1","unstructured":"2023. Recursively split by character. https:\/\/python.langchain.com\/docs."},{"key":"e_1_2_1_9_1","volume-title":"Llm-deliberation: Evaluating llms with interactive multi-agent negotiation games. arXiv preprint arXiv:2309.17234","author":"Abdelnabi Sahar","year":"2023","unstructured":"Sahar Abdelnabi, Amr Gomaa, Sarath Sivaprasad, Lea Sch\u00f6nherr, and Mario Fritz. 2023. Llm-deliberation: Evaluating llms with interactive multi-agent negotiation games. arXiv preprint arXiv:2309.17234 (2023)."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICACCS48705.2020.9074315"},{"key":"e_1_2_1_11_1","volume-title":"Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940","author":"Bello Irwan","year":"2016","unstructured":"Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. 2016. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940 (2016)."},{"key":"e_1_2_1_12_1","volume-title":"19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)","author":"Bhardwaj Romil","year":"2022","unstructured":"Romil Bhardwaj, Zhengxu Xia, Ganesh Ananthanarayanan, Junchen Jiang, Yuanchao Shu, Nikolaos Karianakis, Kevin Hsieh, Paramvir Bahl, and Ion Stoica. 2022. Ekya: Continuous learning of video analytics models on edge compute servers. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 119--135."},{"key":"e_1_2_1_13_1","first-page":"406","article-title":"Scaling video analytics on constrained edge nodes","volume":"1","author":"Canel Christopher","year":"2019","unstructured":"Christopher Canel, Thomas Kim, Giulio Zhou, Conglong Li, Hyeontaek Lim, David G Andersen, Michael Kaminsky, and Subramanya Dulloor. 2019. Scaling video analytics on constrained edge nodes. Proceedings of Machine Learning and Systems 1 (2019), 406--417.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i5.16484"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/JSEN.2016.2628099"},{"key":"e_1_2_1_16_1","volume-title":"Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems. 155--168","author":"Yu-Han Chen Tiffany","year":"2015","unstructured":"Tiffany Yu-Han Chen, Lenin Ravindranath, Shuo Deng, Paramvir Bahl, and Hari Balakrishnan. 2015. Glimpse: Continuous, real-time object recognition on mobile devices. In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems. 155--168."},{"key":"e_1_2_1_17_1","volume-title":"International Conference on Machine Learning. PMLR, 3852--3878","author":"Cheng Ching-An","year":"2022","unstructured":"Ching-An Cheng, Tengyang Xie, Nan Jiang, and Alekh Agarwal. 2022. Adversarially trained actor critic for offline reinforcement learning. In International Conference on Machine Learning. PMLR, 3852--3878."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00489"},{"key":"e_1_2_1_19_1","volume-title":"Deep reinforcement learning from human preferences. Advances in neural information processing systems 30","author":"Christiano Paul F","year":"2017","unstructured":"Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_2_1_20_1","volume-title":"Hybrid actor-critic reinforcement learning in parameterized action space. arXiv preprint arXiv:1903.01344","author":"Fan Zhou","year":"2019","unstructured":"Zhou Fan, Rui Su, Weinan Zhang, and Yong Yu. 2019. Hybrid actor-critic reinforcement learning in parameterized action space. arXiv preprint arXiv:1903.01344 (2019)."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1186\/s40537-019-0234-z"},{"key":"e_1_2_1_22_1","volume-title":"The Chronicles of RAG: The Retriever, the Chunk and the Generator. arXiv preprint arXiv:2401.07883","author":"Finardi Paulo","year":"2024","unstructured":"Paulo Finardi, Leonardo Avila, Rodrigo Castaldoni, Pedro Gengo, Celio Larcher, Marcos Piau, Pablo Costa, and Vinicius Carid\u00e1. 2024. The Chronicles of RAG: The Retriever, the Chunk and the Generator. arXiv preprint arXiv:2401.07883 (2024)."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2008.928765"},{"key":"e_1_2_1_24_1","volume-title":"Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997","author":"Gao Yunfan","year":"2023","unstructured":"Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 (2023)."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-021-01453-z"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSMCC.2012.2218595"},{"key":"e_1_2_1_27_1","volume-title":"Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921","author":"Gu Xiuye","year":"2021","unstructured":"Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. 2021. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021)."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3264921"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.14778\/3554821.3554843"},{"key":"e_1_2_1_30_1","volume-title":"Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680","author":"Guo Taicheng","year":"2024","unstructured":"Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680 (2024)."},{"key":"e_1_2_1_31_1","volume-title":"Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149","author":"Han Song","year":"2015","unstructured":"Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015)."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/IST.2018.8577157"},{"key":"e_1_2_1_33_1","doi-asserted-by":"crossref","unstructured":"Wenyi Hong Weihan Wang Qingsong Lv Jiazheng Xu Wenmeng Yu Junhui Ji Yan Wang Zihan Wang Yuxiao Dong Ming Ding and Jie Tang. 2023. CogAgent: A Visual Language Model for GUI Agents. arXiv:2312.08914 [cs.CV]","DOI":"10.1109\/CVPR52733.2024.01354"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPR.2010.579"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00140"},{"key":"e_1_2_1_36_1","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Hsieh Kevin","year":"2018","unstructured":"Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Shivaram Venkataraman, Paramvir Bahl, Matthai Philipose, Phillip B Gibbons, and Onur Mutlu. 2018. Focus: Querying large video datasets with low latency and low cost. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 269--286."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/SEC.2018.00016"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3230543.3230574"},{"key":"e_1_2_1_39_1","volume-title":"BlazeIt: optimizing declarative aggregation and limit queries for neural network-based video analytics. arXiv preprint arXiv:1805.01046","author":"Kang Daniel","year":"2018","unstructured":"Daniel Kang, Peter Bailis, and Matei Zaharia. 2018. BlazeIt: optimizing declarative aggregation and limit queries for neural network-based video analytics. arXiv preprint arXiv:1805.01046 (2018)."},{"key":"e_1_2_1_40_1","volume-title":"Noscope: optimizing neural network queries over video at scale. arXiv preprint arXiv:1703.02529","author":"Kang Daniel","year":"2017","unstructured":"Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2017. Noscope: optimizing neural network queries over video at scale. arXiv preprint arXiv:1703.02529 (2017)."},{"key":"e_1_2_1_41_1","first-page":"1936","article-title":"Sym-nco: Leveraging symmetricity for neural combinatorial optimization","volume":"35","author":"Kim Minsu","year":"2022","unstructured":"Minsu Kim, Junyoung Park, and Jinkyoo Park. 2022. Sym-nco: Leveraging symmetricity for neural combinatorial optimization. Advances in Neural Information Processing Systems 35 (2022), 1936--1949.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_42_1","unstructured":"Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. Technical Report."},{"key":"e_1_2_1_43_1","volume-title":"Large multimodal models: Notes on cvpr 2023 tutorial. arXiv preprint arXiv:2306.14895","author":"Chunyuan Li.","year":"2023","unstructured":"Chunyuan Li. 2023. Large multimodal models: Notes on cvpr 2023 tutorial. arXiv preprint arXiv:2306.14895 (2023)."},{"key":"e_1_2_1_44_1","first-page":"1","article-title":"FMT: A wearable camera-based object tracking memory aid for older adults","volume":"3","author":"Li Franklin Mingzhe","year":"2019","unstructured":"Franklin Mingzhe Li, Di Laura Chen, Mingming Fan, and Khai N Truong. 2019. FMT: A wearable camera-based object tracking memory aid for older adults. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3, 3 (2019), 1--25.","journal-title":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"},{"key":"e_1_2_1_45_1","first-page":"51991","article-title":"Camel: Communicative agents for\" mind\" exploration of large language model society","volume":"36","author":"Li Guohao","year":"2023","unstructured":"Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. Camel: Communicative agents for\" mind\" exploration of large language model society. Advances in Neural Information Processing Systems 36 (2023), 51991--52008.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_46_1","volume-title":"Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597","author":"Li Junnan","year":"2023","unstructured":"Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)."},{"key":"e_1_2_1_47_1","volume-title":"Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470","author":"Li Junlong","year":"2023","unstructured":"Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, and Pengfei Liu. 2023. Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470 (2023)."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3387514.3405874"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISM.2015.52"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3631429"},{"key":"e_1_2_1_51_1","unstructured":"Shilong Liu Zhaoyang Zeng Tianhe Ren Feng Li Hao Zhang Jie Yang Chunyuan Li Jianwei Yang Hang Su Jun Zhu et al. 2023. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)."},{"key":"e_1_2_1_52_1","first-page":"17703","article-title":"Merging models with fisher-weighted averaging","volume":"35","author":"Matena Michael S","year":"2022","unstructured":"Michael S Matena and Colin A Raffel. 2022. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems 35 (2022), 17703--17716.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_53_1","unstructured":"Brendan McMahan Eider Moore Daniel Ramage Seth Hampson and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics. PMLR 1273--1282."},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01367"},{"key":"e_1_2_1_55_1","doi-asserted-by":"crossref","unstructured":"Volodymyr Mnih Koray Kavukcuoglu David Silver Andrei A Rusu Joel Veness Marc G Bellemare Alex Graves Martin Riedmiller Andreas K Fidjeland Georg Ostrovski et al. 2015. Human-level control through deep reinforcement learning. nature 518 7540 (2015) 529--533.","DOI":"10.1038\/nature14236"},{"key":"e_1_2_1_56_1","volume-title":"Advances in neural information processing systems 34","author":"Narasimhan Medhini","year":"2021","unstructured":"Medhini Narasimhan, Anna Rohrbach, and Trevor Darrell. 2021. Clip-it! language-guided video summarization. Advances in neural information processing systems 34 (2021), 13988--14000."},{"key":"e_1_2_1_57_1","volume-title":"20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)","author":"Padmanabhan Arthi","year":"2023","unstructured":"Arthi Padmanabhan, Neil Agarwal, Anand Iyer, Ganesh Ananthanarayanan, Yuanchao Shu, Nikolaos Karianakis, Guoqing Harry Xu, and Ravi Netravali. 2023. Gemel: Model Merging for {Memory-Efficient}, {Real-Time } Video Analytics at the Edge. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 973--994."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW60793.2023.00035"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01236"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46493-0_32"},{"key":"e_1_2_1_61_1","volume-title":"Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015)."},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00138-009-0231-x"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46484-8_1"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.229"},{"key":"e_1_2_1_65_1","volume-title":"arXiv preprint arXiv:2305.03053","author":"Stoica George","year":"2023","unstructured":"George Stoica, Daniel Bolya, Jakob Bjorner, Taylor Hearn, and Judy Hoffman. 2023. ZipIt! Merging Models from Different Tasks without Training. arXiv preprint arXiv:2305.03053 (2023)."},{"key":"e_1_2_1_66_1","volume-title":"Language-Guided Self-Supervised Video Summarization Using Text Semantic Matching Considering the Diversity of the Video. arXiv preprint arXiv:2405.08890","author":"Sugihara Tomoya","year":"2024","unstructured":"Tomoya Sugihara, Shuntaro Masuda, Ling Xiao, and Toshihiko Yamasaki. 2024. Language-Guided Self-Supervised Video Summarization Using Text Semantic Matching Considering the Diversity of the Video. arXiv preprint arXiv:2405.08890 (2024)."},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v30i1.10295"},{"key":"e_1_2_1_68_1","volume-title":"Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100","author":"Wang Jianfeng","year":"2022","unstructured":"Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. 2022. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100 (2022)."},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457550"},{"key":"e_1_2_1_70_1","volume-title":"Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks","author":"Wang Lin","year":"2021","unstructured":"Lin Wang and Kuk-Jin Yoon. 2021. Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE transactions on pattern analysis and machine intelligence 44, 6 (2021), 3048--3068."},{"key":"e_1_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM41043.2020.9155284"},{"key":"e_1_2_1_72_1","unstructured":"Weihan Wang Qingsong Lv Wenmeng Yu Wenyi Hong Ji Qi Yan Wang Junhui Ji Zhuoyi Yang Lei Zhao Xixuan Song Jiazheng Xu Bin Xu Juanzi Li Yuxiao Dong Ming Ding and Jie Tang. 2023. CogVLM: Visual Expert for Pretrained Language Models. arXiv:2311.03079 [cs.CV]"},{"key":"e_1_2_1_73_1","doi-asserted-by":"crossref","unstructured":"Wenhui Wang Furu Wei Li Dong Hangbo Bao Nan Yang and Ming Zhou. 2020. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. arXiv:2002.10957 [cs.CL]","DOI":"10.18653\/v1\/2021.findings-acl.188"},{"key":"e_1_2_1_74_1","volume-title":"Aakanksha Chowdhery, and Denny Zhou.","author":"Wang Xuezhi","year":"2022","unstructured":"Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022)."},{"key":"e_1_2_1_75_1","volume-title":"Chi, and Denny Zhou","author":"Wang Xuezhi","year":"2022","unstructured":"Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2022. Rationale-augmented ensembles in language models. arXiv preprint arXiv:2207.00747 (2022)."},{"key":"e_1_2_1_76_1","first-page":"8483","article-title":"Language models with image descriptors are strong few-shot video-language learners","volume":"35","author":"Wang Zhenhailong","year":"2022","unstructured":"Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, et al. 2022. Language models with image descriptors are strong few-shot video-language learners. Advances in Neural Information Processing Systems 35 (2022), 8483--8497.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_77_1","first-page":"24824","article-title":"Chain-of-thought prompting elicits reasoning in large language models","volume":"35","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824--24837.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_78_1","volume-title":"International Conference on Machine Learning. PMLR, 23965--23998","author":"Wortsman Mitchell","year":"2022","unstructured":"Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. 2022. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning. PMLR, 23965--23998."},{"key":"e_1_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01025"},{"key":"e_1_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.521"},{"key":"e_1_2_1_81_1","unstructured":"Zhiheng Xi Wenxiang Chen Xin Guo Wei He Yiwen Ding Boyang Hong Ming Zhang Junzhe Wang Senjie Jin Enyu Zhou et al. 2023. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864 (2023)."},{"key":"e_1_2_1_82_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICC.2018.8422970"},{"key":"e_1_2_1_83_1","volume-title":"Towards reasoning in large language models via multi-agent peer review collaboration. arXiv preprint arXiv:2311.08152","author":"Xu Zhenran","year":"2023","unstructured":"Zhenran Xu, Senbao Shi, Baotian Hu, Jindi Yu, Dongfang Li, Min Zhang, and Yuxiang Wu. 2023. Towards reasoning in large language models via multi-agent peer review collaboration. arXiv preprint arXiv:2311.08152 (2023)."},{"key":"e_1_2_1_84_1","volume-title":"React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629","author":"Yao Shunyu","year":"2022","unstructured":"Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022)."},{"key":"e_1_2_1_85_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00271"},{"key":"e_1_2_1_86_1","volume-title":"Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490","author":"Yu Weihao","year":"2023","unstructured":"Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. 2023. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023)."},{"key":"e_1_2_1_87_1","unstructured":"Lianmin Zheng Wei-Lin Chiang Ying Sheng Siyuan Zhuang Zhanghao Wu Yonghao Zhuang Zi Lin Zhuohan Li Dacheng Li Eric Xing et al. 2023. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685 (2023)."},{"key":"e_1_2_1_88_1","volume-title":"Tinyllava: A framework of small-scale large multimodal models. arXiv preprint arXiv:2402.14289","author":"Zhou Baichuan","year":"2024","unstructured":"Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, and Lei Huang. 2024. Tinyllava: A framework of small-scale large multimodal models. arXiv preprint arXiv:2402.14289 (2024)."},{"key":"e_1_2_1_89_1","volume-title":"Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593","author":"Ziegler Daniel M","year":"2019","unstructured":"Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. 2019. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 (2019)."}],"container-title":["Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3699731","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3699731","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,25]],"date-time":"2025-09-25T16:27:20Z","timestamp":1758817640000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3699731"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,21]]},"references-count":89,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,11,21]]}},"alternative-id":["10.1145\/3699731"],"URL":"https:\/\/doi.org\/10.1145\/3699731","relation":{},"ISSN":["2474-9567"],"issn-type":[{"value":"2474-9567","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,11,21]]},"assertion":[{"value":"2024-11-21","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}