{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T15:55:12Z","timestamp":1781538912742,"version":"3.54.5"},"publisher-location":"New York, NY, USA","reference-count":51,"publisher":"ACM","license":[{"start":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T00:00:00Z","timestamp":1781481600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/legalcode"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,6,16]]},"DOI":"10.1145\/3805622.3810586","type":"proceedings-article","created":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T14:42:57Z","timestamp":1781534577000},"page":"1147-1156","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0872-2054","authenticated-orcid":false,"given":"Chengxi","family":"Zeng","sequence":"first","affiliation":[{"name":"University of Bristol, Bristol, United Kingdom"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-6102-5133","authenticated-orcid":false,"given":"Yuxuan","family":"Jiang","sequence":"additional","affiliation":[{"name":"University of Bristol, Bristol, United Kingdom"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-4202-9791","authenticated-orcid":false,"given":"Ge","family":"Gao","sequence":"additional","affiliation":[{"name":"University of Bristol, Bristol, United Kingdom"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1595-3619","authenticated-orcid":false,"given":"Shuai","family":"Wang","sequence":"additional","affiliation":[{"name":"University of Amsterdam, Amsterdam, Netherlands"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9320-7099","authenticated-orcid":false,"given":"Duolikun","family":"Danier","sequence":"additional","affiliation":[{"name":"University of Edinburgh, Edinburgh, United Kingdom"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9213-2611","authenticated-orcid":false,"given":"Bin","family":"Zhu","sequence":"additional","affiliation":[{"name":"Singapore Management University, Singapore, Singapore"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1904-8736","authenticated-orcid":false,"given":"Stevan","family":"Rudinac","sequence":"additional","affiliation":[{"name":"University of Amsterdam, Amsterdam, Netherlands"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7634-190X","authenticated-orcid":false,"given":"David","family":"Bull","sequence":"additional","affiliation":[{"name":"University of Bristol, Bristol, United Kingdom"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6623-9936","authenticated-orcid":false,"given":"Fan","family":"Zhang","sequence":"additional","affiliation":[{"name":"University of Bristol, Bristol, United Kingdom"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2026,6,15]]},"reference":[{"key":"e_1_3_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2016.7533003"},{"key":"e_1_3_3_1_3_2","unstructured":"Nicolas Carion Laura Gustafson Yuan-Ting Hu Shoubhik Debnath Ronghang Hu Didac Suris Chaitanya Ryali Kalyan\u00a0Vasudev Alwala Haitham Khedr Andrew Huang Jie Lei Tengyu Ma Baishan Guo Arpit Kalla Markus Marks Joseph Greer Meng Wang Peize Sun Roman R\u00e4dle Triantafyllos Afouras Effrosyni Mavroudi Katherine Xu Tsung-Han Wu Yu Zhou Liliane Momeni Rishi Hazra Shuangrui Ding Sagar Vaze Francois Porcher Feng Li Siyuan Li Aishwarya Kamath Ho\u00a0Kei Cheng Piotr Doll\u00e1r Nikhila Ravi Kate Saenko Pengchuan Zhang and Christoph Feichtenhofer. 2025. SAM 3: Segment Anything with Concepts. arxiv:https:\/\/arXiv.org\/abs\/2511.16719\u00a0[cs.CV] https:\/\/arxiv.org\/abs\/2511.16719"},{"key":"e_1_3_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_3_3_1_5_2","doi-asserted-by":"crossref","unstructured":"Elena Facco Maria d\u2019Errico Alex Rodriguez and Alessandro Laio. 2017. Estimating the Intrinsic Dimension of Datasets by a Minimal Neighborhood Information. Scientific Reports 7 1 (2017) 12140.","DOI":"10.1038\/s41598-017-11873-y"},{"key":"e_1_3_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52734.2025.01396"},{"key":"e_1_3_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-20059-5_31"},{"key":"e_1_3_3_1_8_2","volume-title":"Gemini 2.5: Multimodal Foundation Models","author":"DeepMind Google","year":"2025","unstructured":"Google DeepMind. 2025. Gemini 2.5: Multimodal Foundation Models. Technical Report. Google. Technical Report."},{"key":"e_1_3_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00550"},{"key":"e_1_3_3_1_10_2","unstructured":"Geoffrey Hinton Oriol Vinyals and Jeff Dean. 2015. Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1503.02531 (2015)."},{"key":"e_1_3_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46448-0_7"},{"key":"e_1_3_3_1_12_2","unstructured":"Junjie Jiang Zelin Wang Manqi Zhao Yin Li and DongSheng Jiang. 2025. SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2504.04519 (2025)."},{"key":"e_1_3_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.372"},{"key":"e_1_3_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00180"},{"key":"e_1_3_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00371"},{"key":"e_1_3_3_1_16_2","volume-title":"Advances in Neural Information Processing Systems (NeurIPS)","author":"Levina Elizaveta","year":"2004","unstructured":"Elizaveta Levina and Peter\u00a0J. Bickel. 2004. Maximum Likelihood Estimation of Intrinsic Dimension. In Advances in Neural Information Processing Systems (NeurIPS) , Vol.\u00a017."},{"key":"e_1_3_3_1_17_2","volume-title":"European Conference on Computer Vision (ECCV)","author":"Liu Shilong","year":"2024","unstructured":"Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. 2024. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. In European Conference on Computer Vision (ECCV)."},{"key":"e_1_3_3_1_18_2","unstructured":"Ilya Loshchilov and Frank Hutter. 2016. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1608.03983 (2016)."},{"key":"e_1_3_3_1_19_2","volume-title":"International Conference on Learning Representations (ICLR)","author":"Loshchilov Ilya","year":"2019","unstructured":"Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_3_3_1_20_2","doi-asserted-by":"crossref","unstructured":"Jun Ma Yuting He Feifei Li Lin Han Chengwei You and Bo Wang. 2024. Segment anything in medical images. Nature Communications 15 1 (2024) 654.","DOI":"10.1038\/s41467-024-44824-z"},{"key":"e_1_3_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.9"},{"key":"e_1_3_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00864"},{"key":"e_1_3_3_1_23_2","unstructured":"Matthias Minderer Alexey Gritsenko and Neil Houlsby. 2024. Scaling open-vocabulary object detection. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2306.09683 (2024)."},{"key":"e_1_3_3_1_24_2","unstructured":"Jordi Pont-Tuset Federico Perazzi Sergi Caelles Pablo Arbel\u00e1ez Alex Sorkine-Hornung and Luc Van\u00a0Gool. 2017. The 2017 DAVIS Challenge on Video Object Segmentation. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1704.00675 (2017)."},{"key":"e_1_3_3_1_25_2","first-page":"8748","volume-title":"International Conference on Machine Learning (ICML)","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong\u00a0Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning (ICML). 8748\u20138763."},{"key":"e_1_3_3_1_26_2","unstructured":"Nikhila Ravi Valentin Gabeur Yuan-Ting Hu Ronghang Hu Chaitanya Ryali Tengyu Ma Haitham Khedr Roman R\u00e4dle Chloe Rolland Laura Gustafson Eric Mintun Junting Pan Kalyan\u00a0Vasudev Alwala Nicolas Carion Chao-Yuan Wu Ross Girshick Piotr Doll\u00e1r and Christoph Feichtenhofer. 2024. SAM 2: Segment Anything in Images and Videos. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2408.00714 (2024)."},{"key":"e_1_3_3_1_27_2","unstructured":"Tianhe Ren Yihao Chen Qing Jiang Zhaoyang Zeng Yuda Xiong Wenlong Liu Zhengyu Ma Junyi Shen Yuan Gao Xiaoke Jiang et\u00a0al. 2025. DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2411.14347 (2025)."},{"key":"e_1_3_3_1_28_2","volume-title":"Roboflow 100 Benchmark","year":"2022","unstructured":"Roboflow. 2022. RF100: A Large-Scale Multi-Domain Benchmark for Object Detection. In Roboflow 100 Benchmark. Available at https:\/\/www.rf100.org\/."},{"key":"e_1_3_3_1_29_2","unstructured":"Victor Sanh Lysandre Debut Julien Chaumond and Thomas Wolf. 2019. DistilBERT a Distilled Version of BERT: Smaller Faster Cheaper and Lighter. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1910.01108 (2019)."},{"key":"e_1_3_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01253"},{"key":"e_1_3_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.195"},{"key":"e_1_3_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01511"},{"key":"e_1_3_3_1_33_2","unstructured":"Ao Wang Hui Chen Zijia Lin Jungong Han and Guiguang Ding. 2023. RepViT-SAM: Towards Real-Time Segmenting Anything. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2312.05760 (2023)."},{"key":"e_1_3_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW63382.2024.00367"},{"key":"e_1_3_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2017.8296962"},{"key":"e_1_3_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00363"},{"key":"e_1_3_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01525"},{"key":"e_1_3_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00289"},{"key":"e_1_3_3_1_39_2","unstructured":"Ning Xu Linjie Yang Yuchen Fan Dingcheng Yue Yuchen Liang Jianchao Yang and Thomas\u00a0S. Huang. 2018. YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1809.03327 (2018)."},{"key":"e_1_3_3_1_40_2","unstructured":"Cheng-Yen Yang Hsiang-Wei Huang Wenhao Chai Zhongyu Jiang and Jenq-Neng Hwang. 2024. SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2411.11922 (2024)."},{"key":"e_1_3_3_1_41_2","unstructured":"En Yu Tiancai Wang Zhuoling Li Yuang Zhang Xiangyu Zhang and Wenbing Tao. 2023. MOTRv3: Release-Fetch Supervision for End-to-End Multi-Object Tracking. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2305.14298 (2023)."},{"key":"e_1_3_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46475-6_5"},{"key":"e_1_3_3_1_43_2","doi-asserted-by":"publisher","unstructured":"Jan Zah\u00e1lka Stevan Rudinac Bj\u00f6rn\u00a0\u00de\u00f3r J\u00f3nsson Dennis\u00a0C. Koelma and Marcel Worring. 2018. Blackthorn: Large-Scale Interactive Multimodal Learning. IEEE Transactions on Multimedia 20 3 (2018) 687\u2013698. 10.1109\/TMM.2017.2755986","DOI":"10.1109\/TMM.2017.2755986"},{"key":"e_1_3_3_1_44_2","unstructured":"Chengxi Zeng Yuxuan Jiang and Aaron Zhang. 2025. EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1 2 and 3. arxiv:https:\/\/arXiv.org\/abs\/2511.15833\u00a0[cs.CV] https:\/\/arxiv.org\/abs\/2511.15833"},{"key":"e_1_3_3_1_45_2","unstructured":"Chengxi Zeng Yuxuan Jiang Fan Zhang Alberto Gambaruto and Tilo Burghardt. 2025. Agglomerating Large Vision Encoders via Distillation for VFSS Segmentation. arxiv:https:\/\/arXiv.org\/abs\/2504.02351\u00a0[cs.CV] https:\/\/arxiv.org\/abs\/2504.02351"},{"key":"e_1_3_3_1_46_2","unstructured":"Chengxi Zeng David Smithard Alberto\u00a0M Gambaruto and Tilo Burghardt. 2025. Tuning Vision Foundation Model via Test-Time Prompt-Guided Training for VFSS Segmentations. arxiv:https:\/\/arXiv.org\/abs\/2501.18474\u00a0[cs.CV] https:\/\/arxiv.org\/abs\/2501.18474"},{"key":"e_1_3_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-19812-0_38"},{"key":"e_1_3_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP55913.2025.11084428"},{"key":"e_1_3_3_1_49_2","unstructured":"Chaoning Zhang Dongshen Han Yu Qiao Jung\u00a0Uk Kim Sung-Ho Bae Seungkyu Lee and Choong\u00a0Seon Hong. 2023. Faster Segment Anything: Towards Lightweight SAM for Mobile Applications. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2306.14289 (2023)."},{"key":"e_1_3_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-20047-2_1"},{"key":"e_1_3_3_1_51_2","volume-title":"arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2306.12156","author":"Zhao Xu","year":"2023","unstructured":"Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. 2023. Fast Segment Anything. In arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2306.12156."},{"key":"e_1_3_3_1_52_2","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Zhou Chong","year":"2024","unstructured":"Chong Zhou, Xiangtai Li, Chen\u00a0Change Loy, and Bo Dai. 2024. EdgeSAM: Prompt-in-the-Loop Distillation for On-Device Deployment of SAM. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)."}],"event":{"name":"ICMR '26: International Conference on Multimedia Retrieval","location":"Amsterdam The Netherlands","acronym":"ICMR '26","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 2026 International Conference on Multimedia Retrieval"],"original-title":[],"deposited":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T15:06:23Z","timestamp":1781535983000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3805622.3810586"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,6,15]]},"references-count":51,"alternative-id":["10.1145\/3805622.3810586","10.1145\/3805622"],"URL":"https:\/\/doi.org\/10.1145\/3805622.3810586","relation":{},"subject":[],"published":{"date-parts":[[2026,6,15]]},"assertion":[{"value":"2026-06-15","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}