{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T19:56:08Z","timestamp":1765310168191,"version":"3.46.0"},"publisher-location":"New York, NY, USA","reference-count":32,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,10,27]]},"DOI":"10.1145\/3746027.3755203","type":"proceedings-article","created":{"date-parts":[[2025,10,25]],"date-time":"2025-10-25T07:26:38Z","timestamp":1761377198000},"page":"3808-3816","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["<scp>CapRecover<\/scp>\n                    : A Cross-Modality Feature Inversion Attack Framework on Vision Language Models"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0007-6409-9168","authenticated-orcid":false,"given":"Kedong","family":"Xiu","sequence":"first","affiliation":[{"name":"New York University, New York, New York, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4815-9235","authenticated-orcid":false,"given":"Sai Qian","family":"Zhang","sequence":"additional","affiliation":[{"name":"New York University, New York, New York, USA"}]}],"member":"320","published-online":{"date-parts":[[2025,10,27]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Stability AI. [n.d.]. Activating humanity's potential through generative AI. https:\/\/stability.ai\/."},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_2_1_3_1","volume-title":"An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929","author":"Dosovitskiy Alexey","year":"2020","unstructured":"Alexey Dosovitskiy. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"key":"e_1_3_2_1_4_1","unstructured":"Yichen Gong Delong Ran Jinyuan Liu Conglei Wang Tianshuo Cong Anyu Wang Sisi Duan and Xiaoyun Wang. 2023. FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts. arXiv:2311.05608 [cs.CR]"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_1_6_1","volume-title":"Large language models (LLMs) inference offloading and resource allocation in cloud-edge computing: An active inference approach","author":"He Ying","year":"2024","unstructured":"Ying He, Jingcheng Fang, F Richard Yu, and Victor C Leung. 2024. Large language models (LLMs) inference offloading and resource allocation in cloud-edge computing: An active inference approach. IEEE Transactions on Mobile Computing (2024)."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3359789.3359824"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1613\/jair.3994"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00140"},{"key":"e_1_3_2_1_10_1","volume-title":"Unsupervised prompt learning for vision-language models. arXiv preprint arXiv:2204.03649","author":"Huang Tony","year":"2022","unstructured":"Tony Huang, Jack Chu, and Fangyun Wei. 2022. Unsupervised prompt learning for vision-language models. arXiv preprint arXiv:2204.03649 (2022)."},{"key":"e_1_3_2_1_11_1","volume-title":"Image captions are natural prompts for text-to-image models. arXiv preprint arXiv:2307.08526","author":"Lei Shiye","year":"2023","unstructured":"Shiye Lei, Hao Chen, Sen Zhang, Bo Zhao, and Dacheng Tao. 2023. Image captions are natural prompts for text-to-image models. arXiv preprint arXiv:2307.08526 (2023)."},{"key":"e_1_3_2_1_12_1","volume-title":"International conference on machine learning. PMLR","author":"Li Junnan","year":"2023","unstructured":"Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023b. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730-19742."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01549"},{"key":"e_1_3_2_1_14_1","first-page":"12934","article-title":"Efficientformer: Vision transformers at mobilenet speed","volume":"35","author":"Li Yanyu","year":"2022","unstructured":"Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, and Jian Ren. 2022. Efficientformer: Vision transformers at mobilenet speed. Advances in Neural Information Processing Systems, Vol. 35 (2022), 12934-12949.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_15_1","first-page":"740","volume-title":"Zurich","author":"Lin Tsung-Yi","year":"2014","unstructured":"Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740-755."},{"key":"e_1_3_2_1_16_1","unstructured":"Haotian Liu Chunyuan Li Qingyang Wu and Yong Jae Lee. 2023. Visual Instruction Tuning."},{"key":"e_1_3_2_1_17_1","volume-title":"ensemble, and cooperate! a survey on collaborative strategies in the era of large language models. arXiv preprint arXiv:2407.06089","author":"Lu Jinliang","year":"2024","unstructured":"Jinliang Lu, Ziliang Pang, Min Xiao, Yaochen Zhu, Rui Xia, and Jiajun Zhang. 2024. Merge, ensemble, and cooperate! a survey on collaborative strategies in the era of large language models. arXiv preprint arXiv:2407.06089 (2024)."},{"key":"e_1_3_2_1_18_1","unstructured":"Weidi Luo Siyuan Ma Xiaogeng Liu Xiaoyu Guo and Chaowei Xiao. 2024. JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks. arXiv:2404.03027 [cs.CR]"},{"key":"e_1_3_2_1_19_1","volume-title":"Splitllm: Collaborative inference of llms for model placement and throughput optimization. arXiv preprint arXiv:2410.10759","author":"Mudvari Akrit","year":"2024","unstructured":"Akrit Mudvari, Yuang Jiang, and Leandros Tassiulas. 2024. Splitllm: Collaborative inference of llms for model placement and throughput optimization. arXiv preprint arXiv:2410.10759 (2024)."},{"key":"e_1_3_2_1_20_1","volume-title":"Gabriel Ilharco, Sewoong Oh, and Ludwig Schmidt.","author":"Nguyen Thao","year":"2023","unstructured":"Thao Nguyen, Samir Yitzhak Gadre, Gabriel Ilharco, Sewoong Oh, and Ludwig Schmidt. 2023. Improving multimodal datasets with image captioning. Advances in neural information processing systems, Vol. 36 (2023), 22047-22069."},{"key":"e_1_3_2_1_21_1","unstructured":"OpenAI. 2024. Hello GPT-4o. https:\/\/openai.com\/index\/hello-gpt-4o\/."},{"key":"e_1_3_2_1_22_1","volume-title":"International conference on machine learning. PMLR, 8748-8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748-8763."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00474"},{"key":"e_1_3_2_1_24_1","volume-title":"Plug and Pray: Exploiting off-the-shelf components of Multi-Modal Models. arXiv preprint arXiv:2307.14539","author":"Shayegani Erfan","year":"2023","unstructured":"Erfan Shayegani, Yue Dong, and Nael Abu-Ghazaleh. 2023. Plug and Pray: Exploiting off-the-shelf components of Multi-Modal Models. arXiv preprint arXiv:2307.14539 (2023)."},{"key":"e_1_3_2_1_25_1","volume-title":"Prompt Stealing Attacks Against Text-to-Image Generation Models. In USENIX Security Symposium (USENIX Security). USENIX.","author":"Shen Xinyue","year":"2024","unstructured":"Xinyue Shen, Yiting Qu, Michael Backes, and Yang Zhang. 2024. Prompt Stealing Attacks Against Text-to-Image Generation Models. In USENIX Security Symposium (USENIX Security). USENIX."},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00532"},{"key":"e_1_3_2_1_27_1","unstructured":"Wikipedia. [n.d.]. Homomorphic encryption. https:\/\/en.wikipedia.org\/wiki\/Homomorphic_encryption."},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01153"},{"key":"e_1_3_2_1_29_1","volume-title":"Secure Federated XGBoost with CUDA-accelerated Homomorphic Encryption via NVIDIA FLARE. arXiv preprint arXiv:2504.03909","author":"Xu Ziyue","year":"2025","unstructured":"Ziyue Xu, Yuan-Ting Hsieh, Zhihong Zhang, Holger R Roth, Chester Chen, Yan Cheng, and Andrew Feng. 2025. Secure Federated XGBoost with CUDA-accelerated Homomorphic Encryption via NVIDIA FLARE. arXiv preprint arXiv:2504.03909 (2025)."},{"key":"e_1_3_2_1_30_1","unstructured":"An Yang Baosong Yang Beichen Zhang Binyuan Hui Bo Zheng Bowen Yu Chengyuan Li Dayiheng Liu Fei Huang Haoran Wei et al. 2024. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115 (2024)."},{"key":"e_1_3_2_1_31_1","volume-title":"Edgeshard: Efficient llm inference via collaborative edge computing","author":"Zhang Mingjin","year":"2024","unstructured":"Mingjin Zhang, Xiaoming Shen, Jiannong Cao, Zeyang Cui, and Shan Jiang. 2024. Edgeshard: Efficient llm inference via collaborative edge computing. IEEE Internet of Things Journal (2024)."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.14722\/ndss.2025.230030"}],"event":{"name":"MM '25: The 33rd ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Dublin Ireland","acronym":"MM '25"},"container-title":["Proceedings of the 33rd ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3746027.3755203","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T19:52:46Z","timestamp":1765309966000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3746027.3755203"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,27]]},"references-count":32,"alternative-id":["10.1145\/3746027.3755203","10.1145\/3746027"],"URL":"https:\/\/doi.org\/10.1145\/3746027.3755203","relation":{},"subject":[],"published":{"date-parts":[[2025,10,27]]},"assertion":[{"value":"2025-10-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}