{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,10]],"date-time":"2025-12-10T09:05:08Z","timestamp":1765357508742,"version":"3.44.0"},"reference-count":56,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2023,12,8]],"date-time":"2023-12-08T00:00:00Z","timestamp":1701993600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"United States Air Force Research Laboratory and the Department of the Air Force Artificial Intelligence Accelerator","award":["FA8750-19-2-1000"],"award-info":[{"award-number":["FA8750-19-2-1000"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2023,12,8]]},"abstract":"<jats:p>As image datasets become ubiquitous, the problem of ad-hoc searches over image data is increasingly important. Many high-level data tasks in machine learning, such as constructing datasets for training and testing object detectors, imply finding ad-hoc objects or scenes within large image datasets as a key sub-problem. New foundational visual-semantic embeddings trained on massive web datasets such as Contrastive Language-Image Pre-Training (CLIP) can help users start searches on their own data, but we find there is a long tail of queries where these models fall short in practice. Seesaw is a system for interactive ad-hoc searches on image datasets that integrates state-of-the-art embeddings like CLIP with user feedback in the form of box annotations to help users quickly locate images of interest in their data even in the long tail of harder queries. One key challenge for Seesaw is that, in practice, many sensible approaches to incorporating feedback into future results, including state-of-the-art active-learning algorithms, can worsen results compared to introducing no feedback, partly due to CLIP's high-average performance. Therefore, Seesaw includes several algorithms that empirically result in larger and also more consistent improvements. We compare Seesaw's accuracy to both using CLIP alone and to a state-of-the-art active-learning baseline and find Seesaw consistently helps improve results for users across four datasets and more than a thousand queries. Seesaw increases Average Precision (AP) on search tasks by an average of .08 on a wide benchmark (from a base of .72), and by a .27 on a subset of more difficult queries where CLIP alone performs poorly.<\/jats:p>","DOI":"10.1145\/3626754","type":"journal-article","created":{"date-parts":[[2023,12,12]],"date-time":"2023-12-12T14:01:21Z","timestamp":1702389681000},"page":"1-26","source":"Crossref","is-referenced-by-count":4,"title":["SeeSaw: Interactive Ad-hoc Search Over Image Databases"],"prefix":"10.1145","volume":"1","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5888-4318","authenticated-orcid":false,"given":"Oscar","family":"Moll","sequence":"first","affiliation":[{"name":"MIT CSAIL, Cambridge, MA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-7485-6661","authenticated-orcid":false,"given":"Manuel","family":"Favela","sequence":"additional","affiliation":[{"name":"MIT, Cambridge, MA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7470-3265","authenticated-orcid":false,"given":"Samuel","family":"Madden","sequence":"additional","affiliation":[{"name":"MIT CSAIL, Cambridge, MA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4598-2808","authenticated-orcid":false,"given":"Vijay","family":"Gadepally","sequence":"additional","affiliation":[{"name":"MIT Lincoln Laboratory, Cambridge, MA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6122-0590","authenticated-orcid":false,"given":"Michael","family":"Cafarella","sequence":"additional","affiliation":[{"name":"MIT CSAIL, Cambridge, MA, USA"}]}],"member":"320","published-online":{"date-parts":[[2023,12,12]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/3209978.3210176"},{"key":"e_1_2_1_2_1","volume-title":"PYRAMID METHODS IN IMAGE PROCESSING. undefined","author":"Adelson E","year":"1984","unstructured":"E Adelson, P Burt, C Anderson, J M Ogden, and J Bergen. 1984. PYRAMID METHODS IN IMAGE PROCESSING. undefined (1984). https:\/\/www.semanticscholar.org\/paper\/e49793511ba203e26b99e7e81fd15a7d505b5cea"},{"key":"e_1_2_1_3_1","volume-title":"Advances in Neural Information Processing Systems, H Wallach, H Larochelle, A Beygelzimer, F d\\'Alch\u00e9-Buc","author":"Barbu Andrei","year":"2019","unstructured":"Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. 2019. ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In Advances in Neural Information Processing Systems, H Wallach, H Larochelle, A Beygelzimer, F d\\'Alch\u00e9-Buc, E Fox, and R Garnett (Eds.), Vol. 32. Curran Associates, Inc. https:\/\/proceedings.neurips.cc\/paper\/2019\/file\/97af07a14cacba681feacf3012730892-Paper.pdf"},{"key":"e_1_2_1_4_1","unstructured":"Mikhail Belkin and Partha Niyogi. 2006. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. https:\/\/www.jmlr.org\/papers\/volume7\/belkin06a\/belkin06a.pdf. https:\/\/www.jmlr.org\/papers\/volume7\/belkin06a\/belkin06a.pdf Accessed: 2023--3--7."},{"key":"e_1_2_1_5_1","unstructured":"E. Bernhardsson. [n.d.]. ANNOY: Approximate Nearest Neighbors Oh Yeah. https:\/\/github.com\/spotify\/annoy. Accessed: 2021-05--20."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCOM.1983.1095851"},{"key":"e_1_2_1_7_1","volume-title":"Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu.","author":"Chen Yen-Chun","year":"2019","unstructured":"Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. UNITER: UNiversal Image-TExt Representation Learning. (Sept. 2019). arXiv:1909.11740 [cs.CV] http:\/\/arxiv.org\/abs\/1909.11740"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1561\/1900000006"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i6.20591"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/83.817596"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2014.2300479"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1963405.1963487"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1007\/978--3--642--21064--8_5"},{"key":"e_1_2_1_14_1","volume-title":"Jamie Ryan Kiros, and Sanja Fidler","author":"Faghri Fartash","year":"2017","unstructured":"Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. (July 2017). arXiv:1707.05612 [cs.LG] http:\/\/arxiv.org\/abs\/1707.05612"},{"key":"e_1_2_1_15_1","volume-title":"Marc Aurelio Ranzato, and Tomas Mikolov","author":"Frome Andrea","year":"2013","unstructured":"Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In Advances in Neural Information Processing Systems, C J Burges, L Bottou, M Welling, Z Ghahramani, and K Q Weinberger (Eds.), Vol. 26. Curran Associates, Inc. https:\/\/proceedings.neurips.cc\/paper\/2013\/file\/7cce53cf90577442771720a370c3c723-Paper.pdf"},{"key":"e_1_2_1_16_1","volume-title":"Bayesian Optimal Active Search and Surveying. (June","author":"Garnett Roman","year":"2012","unstructured":"Roman Garnett, Yamuna Krishnamurthy, Xuehan Xiong, Jeff Schneider, and Richard Mann. 2012. Bayesian Optimal Active Search and Surveying. (June 2012). arXiv:1206.6406 [cs.LG] http:\/\/arxiv.org\/abs\/1206.6406"},{"key":"e_1_2_1_17_1","volume-title":"Bayesian Optimal Active Search and Surveying. (June","author":"Garnett Roman","year":"2012","unstructured":"Roman Garnett, Yamuna Krishnamurthy, Xuehan Xiong, Jeff Schneider, and Richard Mann. 2012. Bayesian Optimal Active Search and Surveying. (June 2012). arXiv:1206.6406 [cs.LG]"},{"key":"e_1_2_1_18_1","volume-title":"TREC 2016 Total Recall Track Overview. TREC","author":"Grossman Maura R","year":"2016","unstructured":"Maura R Grossman, G Cormack, and Adam Roegiest. 2016. TREC 2016 Total Recall Track Overview. TREC (2016)."},{"key":"e_1_2_1_19_1","volume-title":"LVIS: A Dataset for Large Vocabulary Instance Segmentation. arXiv [cs.CV] (Aug","author":"Gupta Agrim","year":"2019","unstructured":"Agrim Gupta, Piotr Doll\u00e1r, and Ross Girshick. 2019. LVIS: A Dataset for Large Vocabulary Instance Segmentation. arXiv [cs.CV] (Aug 2019). https:\/\/arxiv.org\/abs\/1908.03195"},{"key":"e_1_2_1_20_1","volume-title":"https:\/\/www.wildlifeinsights.org\/ Accessed on","author":"Insights Wildlife","year":"2023","unstructured":"Wildlife Insights. [n.d.]. Wildlife Insights. https:\/\/www.wildlifeinsights.org\/ Accessed on Mar 26, 2023."},{"key":"e_1_2_1_21_1","unstructured":"Y Ishikawa R Subramanya and C Faloutsos. 1998. MindReader: Querying Databases Through Multiple Examples. VLDB J. (1998). https:\/\/www.semanticscholar.org\/paper\/04938be9fd727ea6363cc950efd263ff82d02b77"},{"key":"e_1_2_1_22_1","volume-title":"Scaling up visual and vision-language representation learning with noisy text supervision. (Feb","author":"Jia Chao","year":"2021","unstructured":"Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. (Feb. 2021). arXiv:2102.05918 [cs.CV] http:\/\/proceedings.mlr.press\/v139\/jia21b\/jia21b.pdf"},{"key":"e_1_2_1_23_1","volume-title":"Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research","volume":"1723","author":"Jiang Shali","year":"2017","unstructured":"Shali Jiang, Gustavo Malkomes, Geoff Converse, Alyssa Shofner, Benjamin Moseley, and Roman Garnett. 2017. Efficient Nonmyopic Active Search. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 1714--1723. https:\/\/proceedings.mlr.press\/v70\/jiang17d.html"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/tcsvt.2004.826775"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1586"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2019.2945942"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994514"},{"key":"e_1_2_1_28_1","volume-title":"Microsoft COCO: Common Objects in Context. CoRR abs\/1405.0312","author":"Lin Tsung-Yi","year":"2014","unstructured":"Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. CoRR abs\/1405.0312 (2014). arXiv:1405.0312 http:\/\/arxiv.org\/abs\/1405.0312"},{"key":"e_1_2_1_29_1","volume-title":"Feature Pyramid Networks for Object Detection. (Dec","author":"Lin Tsung-Yi","year":"2016","unstructured":"Tsung-Yi Lin, Piotr Doll\u00e1r, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2016. Feature Pyramid Networks for Object Detection. (Dec. 2016). arXiv:1612.03144 [cs.CV] http:\/\/arxiv.org\/abs\/1612.03144"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF01589116"},{"key":"e_1_2_1_31_1","unstructured":"C.D. Manning P. Raghavan and H. Sch\u00fctze. 2008. Introduction to Information Retrieval. Cambridge University Press. https:\/\/books.google.com\/books?id=t1PoSh4uwVcC"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511809071"},{"volume-title":"Probabilistic Machine Learning: An Introduction","author":"Murphy Kevin P","key":"e_1_2_1_33_1","unstructured":"Kevin P Murphy. 2022. Probabilistic Machine Learning: An Introduction. MIT Press. https:\/\/play.google.com\/store\/books\/details?id=wrZNEAAAQBAJ"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/1102351.1102430"},{"key":"e_1_2_1_35_1","volume-title":"PyTorch: An Imperative Style","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K\u00f6pf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. (Dec. 2019). arXiv:1912.01703 [cs.LG] http:\/\/arxiv.org\/abs\/1912.01703"},{"key":"e_1_2_1_36_1","volume-title":"Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. 10, 3 (June","author":"Platt John C","year":"2000","unstructured":"John C Platt. 2000. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. 10, 3 (June 2000). http:\/\/dx.doi.org\/"},{"key":"e_1_2_1_37_1","unstructured":"Alec Radford 1. Jong Wook Kim 1. Chris Hallacy Aditya Ramesh Gabriel Goh Sandhini Agarwal Girish Sastry Amanda Askell Pamela Mishkin Jack Clark and et al. [n.d.]. Learning transferable visual models from natural language supervision. https:\/\/cdn.openai.com\/papers\/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf"},{"key":"e_1_2_1_38_1","volume-title":"Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever.","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. (Feb. 2021). arXiv:2103.00020 [cs.CV] http:\/\/arxiv.org\/abs\/2103.00020"},{"key":"e_1_2_1_39_1","series-title":"Lecture Notes in Computer Science","volume-title":"Active search for high recall: A non-stationary extension of Thompson sampling","author":"Renders Jean-Michel","unstructured":"Jean-Michel Renders. 2018. Active search for high recall: A non-stationary extension of Thompson sampling. In Lecture Notes in Computer Science. Springer International Publishing, Cham, 722--728."},{"volume-title":"The SMART Retrieval System -- Experiments in Automatic Document Processing","author":"Rocchio J. J.","key":"e_1_2_1_40_1","unstructured":"J. J. Rocchio. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System -- Experiments in Automatic Document Processing, Gerard Salton (Ed.). Prentice Hall, Englewood Cliffs, NJ, 313--323."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1002\/(SICI)1097-4571(199006)41:4<288::AID-ASI8>3.0.CO;2-H"},{"key":"e_1_2_1_42_1","unstructured":"Burr Settles. [n.d.]. Active Learning. Morgan Claypool."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9781107298019"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.1995.537667"},{"key":"e_1_2_1_45_1","volume-title":"Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019","author":"Tan Fuwen","year":"2019","unstructured":"Fuwen Tan, Paola Cascante-Bonilla, Xiaoxiao Guo, Hui Wu, Song Feng, and Vicente Ordonez. 2019. Drill-down: Interactive Retrieval of Complex Scenes using Natural Language Queries. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8--14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alch\u00e9-Buc, Emily B. Fox, and Roman Garnett (Eds.). 2647--2657."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298658"},{"key":"e_1_2_1_47_1","unstructured":"Nuno Vasconcelos and Andrew Lippman. [n.d.]. Learning from user feedback in image retrieval systems. https:\/\/papers.nips.cc\/paper\/1999\/file\/7283518d47a05a09d33779a17adf1707-Paper.pdf. https:\/\/papers.nips.cc\/paper\/1999\/file\/7283518d47a05a09d33779a17adf1707-Paper.pdf Accessed: 2021--8--12."},{"key":"e_1_2_1_48_1","unstructured":"Leejay Wu Christos Faloutsos Katia Sycara and Terry R Payne. [n.d.]. FALCON: Feedback adaptive loop for content-based retrieval. http:\/\/www.cs.cmu.edu\/~christos\/PUBLICATIONS\/vldb2k-falcon.pdf. http:\/\/www.cs.cmu.edu\/~christos\/PUBLICATIONS\/vldb2k-falcon.pdf Accessed: 2022--5--30."},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.14778\/2535569.2448954"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1002\/ece3.4464"},{"key":"e_1_2_1_51_1","volume-title":"BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling. CoRR abs\/1805.04687","author":"Yu Fisher","year":"2018","unstructured":"Fisher Yu, Wenqi Xian, Yingying Chen, Fangchen Liu, Mike Liao, Vashisht Madhavan, and Trevor Darrell. 2018. BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling. CoRR abs\/1805.04687 (2018). arXiv:1805.04687 http:\/\/arxiv.org\/abs\/1805.04687"},{"key":"e_1_2_1_52_1","volume-title":"Language Processing, and Software Engineering. (Aug","author":"Yu Zhe","year":"2018","unstructured":"Zhe Yu and Tim Menzies. 2018. Total Recall, Language Processing, and Software Engineering. (Aug. 2018). arXiv:1809.00039 [cs.SE]"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00530-002-0070--3"},{"key":"e_1_2_1_54_1","unstructured":"Xiaojin Zhu and Zoubin Ghahramani. 2002. Learning from labeled and unlabeled data with label propagation. (2002). https:\/\/www.semanticscholar.org\/paper\/2a4ca461fa847e8433bab67e7bfe4620371c1f77"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.2200\/S00196ED1V01Y200906AIM006"},{"key":"e_1_2_1_56_1","volume-title":"ICML 2003 workshop on the continuum from labeled to unlabeled data in machine learning and data mining","volume":"3","author":"Zhu Xiaojin","year":"2003","unstructured":"Xiaojin Zhu, John Lafferty, and Zoubin Ghahramani. 2003. Combining active learning and semi-supervised learning using gaussian fields and harmonic functions. In ICML 2003 workshop on the continuum from labeled to unlabeled data in machine learning and data mining, Vol. 3. http:\/\/mlg.eng.cam.ac.uk\/zoubin\/papers\/zglactive.pdf"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3626754","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3626754","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T13:02:28Z","timestamp":1755867748000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3626754"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,12,8]]},"references-count":56,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,12,8]]}},"alternative-id":["10.1145\/3626754"],"URL":"https:\/\/doi.org\/10.1145\/3626754","relation":{},"ISSN":["2836-6573"],"issn-type":[{"type":"electronic","value":"2836-6573"}],"subject":[],"published":{"date-parts":[[2023,12,8]]}}}