{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:59:44Z","timestamp":1750309184878,"version":"3.41.0"},"reference-count":42,"publisher":"Association for Computing Machinery (ACM)","issue":"9","license":[{"start":{"date-parts":[[2024,8,16]],"date-time":"2024-08-16T00:00:00Z","timestamp":1723766400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,9,30]]},"abstract":"<jats:p>Advances in deep learning have enabled accurate language-based search and retrieval (e.g., over user photos) in the cloud. Many users prefer to store their photos in the home due to privacy concerns. As such, a need arises for models that can perform cross-modal search on resource-limited devices. State-of-the-art (SOTA) cross-modal retrieval models achieve high accuracy through learning entangled representations that enable fine-grained similarity calculation between a language query and an image, but at the expense of having a prohibitively high retrieval latency. Alternatively, there is a new class of methods that exhibits good performance with low latency but requires a lot more computational resources and an order of magnitude more training data (i.e., large web-scraped datasets consisting of millions of image\u2013caption pairs), making them infeasible to use in a commercial context. From a pragmatic perspective, none of the existing methods are suitable for developing commercial applications for low-latency cross-modal retrieval on low-resource devices. We propose CrispSearch, a cascaded approach that greatly reduces the retrieval latency with minimal loss in ranking accuracy for on-device language-based image retrieval. The idea behind our approach is to combine a light-weight and runtime-efficient coarse model with a fine re-ranking stage. Given a language query, the coarse model effectively filters out many of the irrelevant image candidates. After this filtering, only a handful of strong candidates will be selected and sent to a fine model for re-ranking. Extensive experimental results with two SOTA models for the fine re-ranking stage on standard benchmark datasets show that CrispSearch results in a speedup of up to 38 times over the SOTA fine methods with negligible performance degradation. Moreover, our method does not require millions of training instances, making it a pragmatic solution to on-device search and retrieval.<\/jats:p><jats:p\/>","DOI":"10.1145\/3649896","type":"journal-article","created":{"date-parts":[[2024,3,15]],"date-time":"2024-03-15T12:02:31Z","timestamp":1710504151000},"page":"1-18","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Realizing Efficient On-Device Language-based Image Retrieval"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5465-2819","authenticated-orcid":false,"given":"Zhiming","family":"Hu","sequence":"first","affiliation":[{"name":"Samsung AI Centre, Toronto, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-1314-1602","authenticated-orcid":false,"given":"Mete","family":"Kemertas","sequence":"additional","affiliation":[{"name":"University of Toronto, Toronto, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-0265-0727","authenticated-orcid":false,"given":"Lan","family":"Xiao","sequence":"additional","affiliation":[{"name":"Meta, Toronto, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-9612-8821","authenticated-orcid":false,"given":"Caleb","family":"Phillips","sequence":"additional","affiliation":[{"name":"Recursion, Toronto, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-0598-8966","authenticated-orcid":false,"given":"Iqbal","family":"Mohomed","sequence":"additional","affiliation":[{"name":"Samsung AI Centre, Toronto, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4479-4901","authenticated-orcid":false,"given":"Afsaneh","family":"Fazly","sequence":"additional","affiliation":[{"name":"Samsung AI Centre, Toronto, Canada"}]}],"member":"320","published-online":{"date-parts":[[2024,8,16]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"Statista. 2022. Smartphone Unit Shipments by Price Category Worldwide from 2012 to 2022. Retrieved March 28 2022 from https:\/\/www.statista.com\/statistics\/934471\/smartphone-shipments-by-price-category-worldwide\/"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_4_2","unstructured":"Haoli Bai Lu Hou Lifeng Shang Xin Jiang Irwin King and Michael R. Lyu. 2022. Towards efficient post-training quantization of pre-trained language models. In Advances in Neural Information Processing Systems 35.1405\u20131418."},{"key":"e_1_3_2_5_2","unstructured":"Max Bain Arsha Nagrani G\u00fcl Varol and Andrew Zisserman. 2022. A clip-hitchhiker\u2019s guide to long video retrieval. arXiv preprint arXiv:2205.08508 (2022)."},{"key":"e_1_3_2_6_2","unstructured":"Lingjiao Chen Matei Zaharia and James Zou. 2020. FrugalML: How to use ML prediction APIs more accurately and cheaply. arXiv preprint:2006.07512 (2020)."},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58577-8_7"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1423"},{"key":"e_1_3_2_9_2","doi-asserted-by":"crossref","unstructured":"J. Donahue L. A. Hendricks M. Rohrbach S. Venugopalan S. Guadarrama K. Saenko and T. Darrell. 2017. Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 4 (2017) 677\u2013691.","DOI":"10.1109\/TPAMI.2016.2599174"},{"key":"e_1_3_2_10_2","unstructured":"Fartash Faghri David J. Fleet Jamie Ryan Kiros and Sanja Fidler. 2018. VSE++: Improving visual-semantic embeddings with hard negatives. GitHub. Retrieved March 21 2024 from https:\/\/github.com\/fartashf\/vsepp"},{"key":"e_1_3_2_11_2","doi-asserted-by":"crossref","unstructured":"Gregor Geigle Jonas Pfeiffer Nils Reimers Ivan Vuli\u0107 and Iryna Gurevych. 2021. Retrieve fast rerank smart: Cooperative and joint approaches for improved cross-modal retrieval. arXiv preprint arXiv:2103.11920 (2021).","DOI":"10.1162\/tacl_a_00473"},{"key":"e_1_3_2_12_2","volume-title":"Proceedings of the NIPS Workshop","author":"Hinton Geoffrey","year":"2015","unstructured":"Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the knowledge in a neural network. In Proceedings of the NIPS Workshop. http:\/\/arxiv.org\/abs\/1503.02531"},{"key":"e_1_3_2_13_2","first-page":"269","volume-title":"Proceedings of USENIX OSDI","author":"Hsieh Kevin","year":"2018","unstructured":"Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Shivaram Venkataraman, Paramvir Bahl, Matthai Philipose, Phillip B. Gibbons, and Onur Mutlu. 2018. Focus: Querying large video datasets with low latency and low cost. In Proceedings of USENIX OSDI. 269\u2013286."},{"key":"e_1_3_2_14_2","doi-asserted-by":"crossref","unstructured":"Zhiming Hu Lan Xiao Mete Kemertas Caleb Phillips Iqbal Mohomed and Afsaneh Fazly. 2022. CrispSearch: Low-latency on-device language-based image retrieval. In Proceedings of MMSys. 62\u201372.","DOI":"10.1145\/3524273.3528181"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3612161"},{"key":"e_1_3_2_16_2","first-page":"12976","volume-title":"Proceedings of","author":"Huang Zhicheng","year":"2021","unstructured":"Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, and Jianlong Fu. 2021. Seeing out of the box: End-to-end pre-training for vision-language representation learning. In Proceedings ofCVPR. 12976\u201312985."},{"key":"e_1_3_2_17_2","unstructured":"Zhicheng Huang Zhaoyang Zeng Bei Liu Dongmei Fu and Jianlong Fu. 2020. Pixel-BERT: Aligning image pixels with text by deep multi-modal Transformers. arXiv preprint arXiv:2004.00849 (2020)."},{"key":"e_1_3_2_18_2","unstructured":"Chao Jia Yinfei Yang Ye Xia Yi-Ting Chen Zarana Parekh Hieu Pham Quoc V. Le Yunhsuan Sung Zhen Li and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918 (2021)."},{"key":"e_1_3_2_19_2","doi-asserted-by":"crossref","unstructured":"Daniel Kang John Emmons Firas Abuzaid Peter Bailis and Matei Zaharia. 2017. NoScope: Optimizing neural network queries over video at scale. arXiv preprint arXiv:1703.02529 (2017).","DOI":"10.14778\/3137628.3137664"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_3_2_21_2","unstructured":"Ryan Kiros Ruslan Salakhutdinov and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014)."},{"key":"e_1_3_2_22_2","unstructured":"Peter Kraft Daniel Kang Deepak Narayanan Shoumik Palkar Peter Bailis and Matei Zaharia. 2019. Willump: A statistically-aware end-to-end optimizer for machine learning inference. arXiv preprint arXiv:1906.01974 (2019)."},{"key":"e_1_3_2_23_2","doi-asserted-by":"crossref","unstructured":"Kuang-Huei Lee Xi Chen Gang Hua Houdong Hu and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Computer Vision\u2014ECCV 2018. Lecture Notes in Computer Science Vol. 11208. Springer 212\u2013228.","DOI":"10.1007\/978-3-030-01225-0_13"},{"key":"e_1_3_2_24_2","first-page":"12888","volume-title":"Proceedings of ICML","author":"Li Junnan","year":"2022","unstructured":"Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of ICML. 12888\u201312900."},{"key":"e_1_3_2_25_2","volume-title":"arXiv preprint arXiv: 2107.07651","author":"Li Junnan","year":"2021","unstructured":"Junnan Li, Ramprasaath R. Selvaraju, Akhilesh D. Gotmare, Shafiq Joty, Caiming Xiong, and Steven C. H. Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. arXiv preprint arXiv: 2107.07651 (2021)."},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00475"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58577-8_8"},{"key":"e_1_3_2_28_2","doi-asserted-by":"crossref","unstructured":"Tsung-Yi Lin Michael Maire Serge Belongie James Hays Pietro Perona Deva Ramanan Piotr Doll\u00e1r and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Computer Vision\u2014ECCV 2014. Lecture Notes in Computer Science Vol. 8693. Springer 740\u2013755.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_2_29_2","first-page":"13","volume-title":"Advances in Neural Information Processing Systems 32","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems 32. 13\u201323."},{"key":"e_1_3_2_30_2","doi-asserted-by":"crossref","unstructured":"Nicola Messina Giuseppe Amato Andrea Esuli Fabrizio Falchi Claudio Gennaro and St\u00e9phane Marchand-Maillet. 2020. Fine-grained visual textual alignment for cross-modal retrieval using Transformer encoders. arXiv preprint arXiv:2008.05231 (2020).","DOI":"10.1145\/3451390"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00970"},{"key":"e_1_3_2_32_2","first-page":"3111","volume-title":"Advances in Neural Information Processing Systems 26","author":"Mikolov Tomas","year":"2013","unstructured":"Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26. 3111\u20133119."},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"key":"e_1_3_2_34_2","unstructured":"Alec Radford Jong Wook Kim Chris Hallacy Aditya Ramesh Gabriel Goh Sandhini Agarwal Girish Sastry Amanda Askell Pamela Mishkin Jack Clark Gretchen Krueger and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)."},{"key":"e_1_3_2_35_2","unstructured":"Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 (2015)."},{"key":"e_1_3_2_36_2","volume-title":"Advances in Neural Information Processing Systems 28","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28. 1\u20139."},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.77"},{"key":"e_1_3_2_38_2","first-page":"5998","volume-title":"Advances in Neural Information Processing Systems 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30. 5998\u20136008."},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2001.990517"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01095"},{"key":"e_1_3_2_41_2","doi-asserted-by":"crossref","unstructured":"Kilian Q. Weinberger and Lawrence K. Saul. 2008. Fast solvers and efficient implementations for distance metric learning.","DOI":"10.1145\/1390156.1390302"},{"key":"e_1_3_2_42_2","unstructured":"Hao Wu Jiayuan Mao Yufeng Zhang Yuning Jiang Lei Li Weiwei Sun and Wei-Ying Ma. 2019. UniVSE: Robust visual semantic embeddings via structured semantic representations. arXiv:1904.05521 (2019)."},{"key":"e_1_3_2_43_2","doi-asserted-by":"crossref","unstructured":"Peter Young Alice Lai Micah Hodosh and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014) 67\u201378.","DOI":"10.1162\/tacl_a_00166"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3649896","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3649896","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T22:54:07Z","timestamp":1750287247000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3649896"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,8,16]]},"references-count":42,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2024,9,30]]}},"alternative-id":["10.1145\/3649896"],"URL":"https:\/\/doi.org\/10.1145\/3649896","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2024,8,16]]},"assertion":[{"value":"2023-04-21","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-02-09","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-08-16","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}