{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,23]],"date-time":"2025-08-23T05:24:54Z","timestamp":1755926694695,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":34,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,6,14]],"date-time":"2022-06-14T00:00:00Z","timestamp":1655164800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,6,14]]},"DOI":"10.1145\/3524273.3528181","type":"proceedings-article","created":{"date-parts":[[2022,8,5]],"date-time":"2022-08-05T22:23:21Z","timestamp":1659738201000},"page":"62-72","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["CrispSearch"],"prefix":"10.1145","author":[{"given":"Zhiming","family":"Hu","sequence":"first","affiliation":[{"name":"Samsung AI Centre, Toronto, Canada"}]},{"given":"Lan","family":"Xiao","sequence":"additional","affiliation":[{"name":"Pinterest, Toronto, Canada"}]},{"given":"Mete","family":"Kemertas","sequence":"additional","affiliation":[{"name":"Samsung AI Centre, Toronto, Canada"}]},{"given":"Caleb","family":"Phillips","sequence":"additional","affiliation":[{"name":"Samsung AI Centre, Toronto, Canada"}]},{"given":"Iqbal","family":"Mohomed","sequence":"additional","affiliation":[{"name":"Samsung AI Centre, Toronto, Canada"}]},{"given":"Afsaneh","family":"Fazly","sequence":"additional","affiliation":[{"name":"Samsung AI Centre, Toronto, Canada"}]}],"member":"320","published-online":{"date-parts":[[2022,8,5]]},"reference":[{"volume-title":"Smartphone Unit Shipments by Price Category Worldwide from 2012 to","year":"2022","key":"e_1_3_2_1_1_1","unstructured":"2022. Smartphone Unit Shipments by Price Category Worldwide from 2012 to 2022 . https:\/\/www.statista.com\/statistics\/934471\/smartphone-shipments-by-price-category-worldwide\/. (2022). Online; accessed 28 March 2022. 2022. Smartphone Unit Shipments by Price Category Worldwide from 2012 to 2022. https:\/\/www.statista.com\/statistics\/934471\/smartphone-shipments-by-price-category-worldwide\/. (2022). Online; accessed 28 March 2022."},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"crossref","unstructured":"Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR.  Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR.","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_1_3_1","volume-title":"FrugalML: How to Use ML Prediction APIs More Accurately and Cheaply. arXiv preprint:2006.07512","author":"Chen Lingjiao","year":"2020","unstructured":"Lingjiao Chen , Matei Zaharia , and James Zou . 2020. FrugalML: How to Use ML Prediction APIs More Accurately and Cheaply. arXiv preprint:2006.07512 ( 2020 ). Lingjiao Chen, Matei Zaharia, and James Zou. 2020. FrugalML: How to Use ML Prediction APIs More Accurately and Cheaply. arXiv preprint:2006.07512 (2020)."},{"key":"e_1_3_2_1_4_1","volume-title":"Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu.","author":"Chen Yen-Chun","year":"2020","unstructured":"Yen-Chun Chen , Linjie Li , Licheng Yu , Ahmed El Kholy , Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020 . Uniter : Universal Image-Text Representation Learning. In ECCV. 104--120. Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal Image-Text Representation Learning. In ECCV. 104--120."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1423"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2599174"},{"key":"e_1_3_2_1_7_1","volume-title":"Jamie Ryan Kiros, and Sanja Fidler","author":"Faghri Fartash","year":"2018","unstructured":"Fartash Faghri , David J Fleet , Jamie Ryan Kiros, and Sanja Fidler . 2018 . VSE++: Improving Visual-Semantic Embeddings with Hard Negatives . (2018). https:\/\/github.com\/fartashf\/vsepp Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. (2018). https:\/\/github.com\/fartashf\/vsepp"},{"key":"e_1_3_2_1_8_1","volume-title":"Retrieve Fast","author":"Geigle Gregor","year":"1920","unstructured":"Gregor Geigle , Jonas Pfeiffer , Nils Reimers , Ivan Vuli\u0107 , and Iryna Gurevych . 2021. Retrieve Fast , Rerank Smart : Cooperative and Joint Approaches for Improved Cross-Modal Retrieval . arXiv preprint arXiv:2103.1 1920 (2021). Gregor Geigle, Jonas Pfeiffer, Nils Reimers, Ivan Vuli\u0107, and Iryna Gurevych. 2021. Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval. arXiv preprint arXiv:2103.11920 (2021)."},{"key":"e_1_3_2_1_9_1","volume-title":"Focus: Querying Large Video Datasets with Low Latency and Low Cost. In USENIX OSDI. 269--286.","author":"Hsieh Kevin","year":"2018","unstructured":"Kevin Hsieh , Ganesh Ananthanarayanan , Peter Bodik , Shivaram Venkataraman , Paramvir Bahl , Matthai Philipose , Phillip B Gibbons , and Onur Mutlu . 2018 . Focus: Querying Large Video Datasets with Low Latency and Low Cost. In USENIX OSDI. 269--286. Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Shivaram Venkataraman, Paramvir Bahl, Matthai Philipose, Phillip B Gibbons, and Onur Mutlu. 2018. Focus: Querying Large Video Datasets with Low Latency and Low Cost. In USENIX OSDI. 269--286."},{"key":"e_1_3_2_1_10_1","volume-title":"Pixel-Bert: Aligning Image Pixels with Text by Deep MultiModal Transformers. arXiv preprint arXiv:2004.00849","author":"Huang Zhicheng","year":"2020","unstructured":"Zhicheng Huang , Zhaoyang Zeng , Bei Liu , Dongmei Fu , and Jianlong Fu. 2020. Pixel-Bert: Aligning Image Pixels with Text by Deep MultiModal Transformers. arXiv preprint arXiv:2004.00849 ( 2020 ). Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-Bert: Aligning Image Pixels with Text by Deep MultiModal Transformers. arXiv preprint arXiv:2004.00849 (2020)."},{"key":"e_1_3_2_1_11_1","volume-title":"Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision. arXiv preprint arXiv:2102.05918","author":"Jia Chao","year":"2021","unstructured":"Chao Jia , Yinfei Yang , Ye Xia , Yi-Ting Chen , Zarana Parekh , Hieu Pham , Quoc V. Le , Yunhsuan Sung , Zhen Li , and Tom Duerig . 2021. Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision. arXiv preprint arXiv:2102.05918 ( 2021 ). Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision. arXiv preprint arXiv:2102.05918 (2021)."},{"key":"e_1_3_2_1_12_1","volume-title":"Noscope: Optimizing Neural Network Queries over Video at Scale. arXiv preprint arXiv:1703.02529","author":"Kang Daniel","year":"2017","unstructured":"Daniel Kang , John Emmons , Firas Abuzaid , Peter Bailis , and Matei Zaharia . 2017 . Noscope: Optimizing Neural Network Queries over Video at Scale. arXiv preprint arXiv:1703.02529 (2017). Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2017. Noscope: Optimizing Neural Network Queries over Video at Scale. arXiv preprint arXiv:1703.02529 (2017)."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"crossref","unstructured":"Andrej Karpathy and Li Fei-Fei. 2015. Deep Visual-Semantic Alignments for Generating Image Descriptions. In CVPR.  Andrej Karpathy and Li Fei-Fei. 2015. Deep Visual-Semantic Alignments for Generating Image Descriptions. In CVPR.","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_3_2_1_14_1","volume-title":"Zemel","author":"Kiros Ryan","year":"2014","unstructured":"Ryan Kiros , Ruslan Salakhutdinov , and Richard S . Zemel . 2014 . Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models . arXiv preprint arXiv:1411.2539 (2014). Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. arXiv preprint arXiv:1411.2539 (2014)."},{"key":"e_1_3_2_1_15_1","volume-title":"Willump: A Statistically-Aware End-to-End Optimizer for Machine Learning Inference. arXiv preprint arXiv:1906.01974","author":"Kraft Peter","year":"2019","unstructured":"Peter Kraft , Daniel Kang , Deepak Narayanan , Shoumik Palkar , Peter Bailis , and Matei Zaharia . 2019 . Willump: A Statistically-Aware End-to-End Optimizer for Machine Learning Inference. arXiv preprint arXiv:1906.01974 (2019). Peter Kraft, Daniel Kang, Deepak Narayanan, Shoumik Palkar, Peter Bailis, and Matei Zaharia. 2019. Willump: A Statistically-Aware End-to-End Optimizer for Machine Learning Inference. arXiv preprint arXiv:1906.01974 (2019)."},{"key":"e_1_3_2_1_16_1","unstructured":"Kuang-Huei Lee Xi Chen Gang Hua Houdong Hu and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching.  Kuang-Huei Lee Xi Chen Gang Hua Houdong Hu and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching."},{"key":"e_1_3_2_1_17_1","volume-title":"Hoi","author":"Li Junnan","year":"2021","unstructured":"Junnan Li , Ramprasaath R. Selvaraju , Akhilesh D. Gotmare , Shafiq Joty , Caiming Xiong , and Steven C.H . Hoi . 2021 . Align Before Fuse: Vision and Language Representation Learning with Momentum Distillation. In arXiv preprint arXiv: 2107.07651. Junnan Li, Ramprasaath R. Selvaraju, Akhilesh D. Gotmare, Shafiq Joty, Caiming Xiong, and Steven C.H. Hoi. 2021. Align Before Fuse: Vision and Language Representation Learning with Momentum Distillation. In arXiv preprint arXiv: 2107.07651."},{"key":"e_1_3_2_1_18_1","unstructured":"Kunpeng Li Yulun Zhang Kai Li Yuanyuan Li and Yun Fu. 2019. Visual Semantic Reasoning for Image-Text Matching. In ICCV. 4653--4661.  Kunpeng Li Yulun Zhang Kai Li Yuanyuan Li and Yun Fu. 2019. Visual Semantic Reasoning for Image-Text Matching. In ICCV. 4653--4661."},{"key":"e_1_3_2_1_19_1","volume-title":"Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks","author":"Li Xiujun","year":"2020","unstructured":"Xiujun Li , Xi Yin , Chunyuan Li , Pengchuan Zhang , Xiaowei Hu , Lei Zhang , Lijuan Wang , Houdong Hu , Li Dong , Furu Wei , 2020 . Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks . In ECCV. Springer , 121--137. Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In ECCV. Springer, 121--137."},{"key":"e_1_3_2_1_20_1","unstructured":"Tsung-Yi Lin Michael Maire Serge Belongie James Hays Pietro Perona Deva Ramanan Piotr Doll\u00c3\u0105r and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context.  Tsung-Yi Lin Michael Maire Serge Belongie James Hays Pietro Perona Deva Ramanan Piotr Doll\u00c3\u0105r and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context."},{"key":"e_1_3_2_1_21_1","first-page":"13","article-title":"ViLBERT","volume":"32","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu , Dhruv Batra , Devi Parikh , and Stefan Lee . 2019 . ViLBERT : Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems 32. 13 -- 23 . Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems 32. 13--23.","journal-title":"Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_22_1","volume-title":"Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders. arXiv preprint arXiv:2008.05231","author":"Messina Nicola","year":"2020","unstructured":"Nicola Messina , Giuseppe Amato , Andrea Esuli , Fabrizio Falchi , Claudio Gennaro , and St\u00c3l'phane Marchand-Maillet . 2020. Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders. arXiv preprint arXiv:2008.05231 ( 2020 ). Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, and St\u00c3l'phane Marchand-Maillet. 2020. Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders. arXiv preprint arXiv:2008.05231 (2020)."},{"key":"e_1_3_2_1_23_1","unstructured":"Tomas Mikolov Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26. 3111--3119.  Tomas Mikolov Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26. 3111--3119."},{"key":"e_1_3_2_1_24_1","volume-title":"Manning","author":"Pennington Jeffrey","year":"2014","unstructured":"Jeffrey Pennington , Richard Socher , and Christopher D . Manning . 2014 . GloVe: Global Vectors for Word Representation. In Proc. of EMNLP. 1532--1543. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Proc. of EMNLP. 1532--1543."},{"key":"e_1_3_2_1_25_1","volume-title":"Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever.","author":"Radford Alec","year":"2021","unstructured":"Alec Radford , Jong Wook Kim , Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021 . Learning Transferable Visual Models from Natural Language Supervision . arXiv preprint arXiv:2103.00020 (2021). Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models from Natural Language Supervision. arXiv preprint arXiv:2103.00020 (2021)."},{"key":"e_1_3_2_1_26_1","volume-title":"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv preprint arXiv:1506.01497","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren , Kaiming He , Ross Girshick , and Jian Sun . 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv preprint arXiv:1506.01497 ( 2015 ). Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv preprint arXiv:1506.01497 (2015)."},{"key":"e_1_3_2_1_27_1","volume-title":"Advances in Neural Information Processing Systems","volume":"28","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren , Kaiming He , Ross Girshick , and Jian Sun . 2015 . Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks . In Advances in Neural Information Processing Systems , Vol. 28 . Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems, Vol. 28."},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"crossref","unstructured":"Siqi Sun Yen-Chun Chen Linjie Li Shuohang Wang Yuwei Fang and Jingjing Liu. 2021. LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval. In NAACL.  Siqi Sun Yen-Chun Chen Linjie Li Shuohang Wang Yuwei Fang and Jingjing Liu. 2021. LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval. In NAACL.","DOI":"10.18653\/v1\/2021.naacl-main.77"},{"key":"e_1_3_2_1_29_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30. 5998--6008.  Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30. 5998--6008."},{"key":"e_1_3_2_1_30_1","unstructured":"Paul Viola and Michael Jones. 2001. Rapid Object Detection using a Boosted Cascade of Simple Features. In CVPR.  Paul Viola and Michael Jones. 2001. Rapid Object Detection using a Boosted Cascade of Simple Features. In CVPR."},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"crossref","unstructured":"Xi Wei Tianzhu Zhang Yan Li Yongdong Zhang and Feng Wu. 2020. Multi-Modality Cross Attention Network for Image and Sentence Matching. In CVPR. 10941--10950.  Xi Wei Tianzhu Zhang Yan Li Yongdong Zhang and Feng Wu. 2020. Multi-Modality Cross Attention Network for Image and Sentence Matching. In CVPR. 10941--10950.","DOI":"10.1109\/CVPR42600.2020.01095"},{"key":"e_1_3_2_1_32_1","volume-title":"Saul","author":"Weinberger Kilian Q.","year":"2008","unstructured":"Kilian Q. Weinberger and Lawrence K . Saul . 2008 . Fast Solvers and Efficient Implementations for Distance Metric Learning . Kilian Q. Weinberger and Lawrence K. Saul. 2008. Fast Solvers and Efficient Implementations for Distance Metric Learning."},{"key":"e_1_3_2_1_33_1","unstructured":"Hao Wu Jiayuan Mao Yufeng Zhang Yuning Jiang Lei Li Weiwei Sun and Wei-Ying Ma. 2019. UniVSE: Robust Visual Semantic Embeddings via Structured Semantic Representations. 6609--6618.  Hao Wu Jiayuan Mao Yufeng Zhang Yuning Jiang Lei Li Weiwei Sun and Wei-Ying Ma. 2019. UniVSE: Robust Visual Semantic Embeddings via Structured Semantic Representations. 6609--6618."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00166"}],"event":{"name":"MMSys '22: 13th ACM Multimedia Systems Conference","sponsor":["SIGMM ACM Special Interest Group on Multimedia","SIGCOMM ACM Special Interest Group on Data Communication","SIGMOBILE ACM Special Interest Group on Mobility of Systems, Users, Data and Computing"],"location":"Athlone Ireland","acronym":"MMSys '22"},"container-title":["Proceedings of the 13th ACM Multimedia Systems Conference"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3524273.3528181","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3524273.3528181","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:31:05Z","timestamp":1750188665000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3524273.3528181"}},"subtitle":["low-latency on-device language-based image retrieval"],"short-title":[],"issued":{"date-parts":[[2022,6,14]]},"references-count":34,"alternative-id":["10.1145\/3524273.3528181","10.1145\/3524273"],"URL":"https:\/\/doi.org\/10.1145\/3524273.3528181","relation":{},"subject":[],"published":{"date-parts":[[2022,6,14]]},"assertion":[{"value":"2022-08-05","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}