{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T09:01:51Z","timestamp":1775638911278,"version":"3.50.1"},"reference-count":76,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2024,9,30]],"date-time":"2024-09-30T00:00:00Z","timestamp":1727654400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100006374","name":"National Science Foundation","doi-asserted-by":"publisher","award":["IIS-2335881,IIS-2238431"],"award-info":[{"award-number":["IIS-2335881,IIS-2238431"]}],"id":[{"id":"10.13039\/501100006374","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2024,10,1]]},"abstract":"<jats:p>Localizing video moments based on the movement patterns of objects is an important task in video analytics. Existing video analytics systems offer two types of querying interfaces based on natural language and SQL, respectively. However, both types of interfaces have major limitations. SQL-based systems require high query specification time, whereas natural language-based systems require large training datasets to achieve satisfactory retrieval accuracy.<\/jats:p>\n                  <jats:p>\n                    To address these limitations, we present SketchQL, a video database management system (VDBMS) for offline, exploratory video moment retrieval that is both easy to use and generalizes well across multiple video moment datasets. To improve ease-of-use, SketchQL features a\n                    <jats:italic toggle=\"yes\">visual query interface<\/jats:italic>\n                    that enables users to sketch complex visual queries through intuitive drag-and-drop actions. To improve generalizability, SketchQL operates on object-tracking primitives that are reliably extracted across various datasets using pre-trained models. We present a learned similarity search algorithm for retrieving video moments closely matching the user's visual query based on object trajectories. SketchQL trains the model on a diverse dataset generated with a novel simulator, that enhances its accuracy across a wide array of datasets and queries. We evaluate SketchQL on four real-world datasets with nine queries, demonstrating its superior usability and retrieval accuracy over state-of-the-art VDBMSs.\n                  <\/jats:p>","DOI":"10.1145\/3677140","type":"journal-article","created":{"date-parts":[[2024,9,30]],"date-time":"2024-09-30T17:41:44Z","timestamp":1727718104000},"page":"1-27","source":"Crossref","is-referenced-by-count":6,"title":["SketchQL: Video Moment Querying with a Visual Query Interface"],"prefix":"10.1145","volume":"2","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9144-8999","authenticated-orcid":false,"given":"Renzhi","family":"Wu","sequence":"first","affiliation":[{"name":"Georgia Institute of Technology, Atlanta, GA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5163-6234","authenticated-orcid":false,"given":"Pramod","family":"Chunduri","sequence":"additional","affiliation":[{"name":"Georgia Institute of Technology, Atlanta, GA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4054-2958","authenticated-orcid":false,"given":"Ali","family":"Payani","sequence":"additional","affiliation":[{"name":"Cisco Systems Inc., Cupertino, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-3202-3767","authenticated-orcid":false,"given":"Xu","family":"Chu","sequence":"additional","affiliation":[{"name":"Georgia Institute of Technology, Atlanta, GA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7706-6978","authenticated-orcid":false,"given":"Joy","family":"Arulraj","sequence":"additional","affiliation":[{"name":"Georgia Institute of Technology, Atlanta, GA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3282-5360","authenticated-orcid":false,"given":"Kexin","family":"Rong","sequence":"additional","affiliation":[{"name":"Georgia Institute of Technology, Atlanta, GA, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,9,30]]},"reference":[{"key":"e_1_2_2_1_1","volume-title":"CLIP (Contrastive Language-image pretraining), predict the most relevant text snippet given an image. https:\/\/github.com\/openai\/CLIP [Online","year":"2023","unstructured":"2021. Openai\/CLIP: CLIP (Contrastive Language-image pretraining), predict the most relevant text snippet given an image. https:\/\/github.com\/openai\/CLIP [Online; accessed 30. Aug. 2023]."},{"key":"e_1_2_2_2_1","volume-title":"ECCV'22","year":"2023","unstructured":"2022. Benchmarking Panoptic Scene Graph Generation (PSG), ECCV'22. https:\/\/github.com\/Jingkang50\/OpenPSG [Online; accessed 30. Aug. 2023]."},{"key":"e_1_2_2_3_1","volume-title":"https:\/\/pytorch.org\/docs\/stable\/generated\/torch.nn.TransformerEncoder.html [Online","author":"TransformerEncoder","year":"2023","unstructured":"2023. TransformerEncoder ' PyTorch 2.0 documentation. https:\/\/pytorch.org\/docs\/stable\/generated\/torch.nn.TransformerEncoder.html [Online; accessed 8. Apr. 2023]."},{"key":"e_1_2_2_4_1","unstructured":"2024. code and sample clips. https:\/\/figshare.com\/s\/da2add67051616fbf5de"},{"key":"e_1_2_2_5_1","volume-title":"Analysis of the trajectories of left-turning vehicles at signalized intersections. Transportation research procedia 48","author":"Abdeljaber Osama","year":"2020","unstructured":"Osama Abdeljaber, Adel Younis, and Wael Alhajyaseen. 2020. Analysis of the trajectories of left-turning vehicles at signalized intersections. Transportation research procedia 48 (2020), 1288--1295."},{"key":"e_1_2_2_6_1","unstructured":"Sami Abu-El-Haija Nisarg Kothari Joonseok Lee Apostol (Paul) Natsev George Toderici Balakrishnan Varadarajan and Sudheendra Vijayanarasimhan. 2016. YouTube-8M: A Large-Scale Video Classification Benchmark. In arXiv:1609.08675. https:\/\/arxiv.org\/pdf\/1609.08675v1.pdf"},{"key":"e_1_2_2_7_1","volume-title":"Practical and optimal LSH for angular distance. Advances in neural information processing systems 28","author":"Andoni Alexandr","year":"2015","unstructured":"Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. 2015. Practical and optimal LSH for angular distance. Advances in neural information processing systems 28 (2015)."},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.618"},{"key":"e_1_2_2_9_1","first-page":"201","article-title":"FeedbackBypass: A new approach to interactive similarity query processing","volume":"1","author":"Bartolini Ilaria","year":"2001","unstructured":"Ilaria Bartolini, Paolo Ciaccia, Florian Waas, et al. 2001. FeedbackBypass: A new approach to interactive similarity query processing. In VLDB, Vol. 1. 201--210.","journal-title":"VLDB"},{"key":"e_1_2_2_10_1","volume-title":"SHIATSU: tagging and retrieving videos without worries. Multimedia tools and applications 63","author":"Bartolini Ilaria","year":"2013","unstructured":"Ilaria Bartolini, Marco Patella, and Corrado Romani. 2013. SHIATSU: tagging and retrieving videos without worries. Multimedia tools and applications 63 (2013), 357--385."},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389692"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/253262.253263"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298698"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/266180.266382"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3137605"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6627"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.14778\/3551793.3551865"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3452803"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3526181"},{"key":"e_1_2_2_20_1","volume-title":"Pinhole camera model - Wikipedia. https:\/\/en.wikipedia.org\/w\/index.php?title=Pinhole_camera_model&oldid=1110164612 [Online","author":"Contributors","year":"2023","unstructured":"Contributors to Wikimedia projects. 2022. Pinhole camera model - Wikipedia. https:\/\/en.wikipedia.org\/w\/index.php?title=Pinhole_camera_model&oldid=1110164612 [Online; accessed 9. Apr. 2023]."},{"key":"e_1_2_2_21_1","volume-title":"Evaluation measures (information retrieval) - Wikipedia. https:\/\/en.wikipedia.org\/w\/index.php?title=Evaluation_measures_(information_retrieval)&oldid=1138907815 [Online","author":"Contributors","year":"2023","unstructured":"Contributors to Wikimedia projects. 2023. Evaluation measures (information retrieval) - Wikipedia. https:\/\/en.wikipedia.org\/w\/index.php?title=Evaluation_measures_(information_retrieval)&oldid=1138907815 [Online; accessed 19. Mar. 2023]."},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/957013.957124"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-005-2715-7"},{"key":"e_1_2_2_24_1","volume-title":"Conference on robot learning. PMLR, 1--16","author":"Dosovitskiy Alexey","year":"2017","unstructured":"Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. 2017. CARLA: An open urban driving simulator. In Conference on robot learning. PMLR, 1--16."},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-020-09414-3"},{"key":"e_1_2_2_26_1","volume-title":"Rekall: Specifying video events using compositions of spatiotemporal labels. arXiv preprint arXiv:1910.02993","author":"Fu Daniel Y","year":"2019","unstructured":"Daniel Y Fu, Will Crichton, James Hong, Xinwei Yao, Haotian Zhang, Anh Truong, Avanika Narayan, Maneesh Agrawala, Christopher R\u00e9, and Kayvon Fatahalian. 2019. Rekall: Specifying video events using compositions of spatiotemporal labels. arXiv preprint arXiv:1910.02993 (2019)."},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.563"},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00155"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2018.00223"},{"key":"e_1_2_2_30_1","volume-title":"Spatio-temporal analysis of team sports--a survey. arXiv preprint arXiv:1602.06994","author":"Gudmundsson Joachim","year":"2016","unstructured":"Joachim Gudmundsson and Michael Horton. 2016. Spatio-temporal analysis of team sports--a survey. arXiv preprint arXiv:1602.06994 (2016)."},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298698"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2578726.2578775"},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/MMDBMS.1995.520425"},{"key":"e_1_2_2_34_1","volume-title":"Proceedings of the fourth ACM international conference on Multimedia. 75--86","author":"Hibino Stacie","year":"1997","unstructured":"Stacie Hibino and Elke A Rundensteiner. 1997. MMVIS: Design and implementation of a multimedia visual information seeking environment. In Proceedings of the fourth ACM international conference on Multimedia. 75--86."},{"key":"e_1_2_2_35_1","volume-title":"Cross attention network for few-shot classification. Advances in neural information processing systems 32","author":"Hou Ruibing","year":"2019","unstructured":"Ruibing Hou, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. 2019. Cross attention network for few-shot classification. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2006.891352"},{"key":"e_1_2_2_37_1","unstructured":"Yonggang Jin and Farzin Mokhtarian. 2004. Efficient Video Retrieval by Motion Trajectory.. In BMVC. Citeseer 1--10."},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298990"},{"key":"e_1_2_2_39_1","volume-title":"BlazeIt: optimizing declarative aggregation and limit queries for neural network-based video analytics. arXiv preprint arXiv:1805.01046","author":"Kang Daniel","year":"2018","unstructured":"Daniel Kang, Peter Bailis, and Matei Zaharia. 2018. BlazeIt: optimizing declarative aggregation and limit queries for neural network-based video analytics. arXiv preprint arXiv:1805.01046 (2018)."},{"key":"e_1_2_2_40_1","volume-title":"SVIQUEL: A spatial visual query and exploration language. In DEXA. 290--299.","author":"Kaushik Sudhir","year":"1998","unstructured":"Sudhir Kaushik and Elke A Rundensteiner. 1998. SVIQUEL: A spatial visual query and exploration language. In DEXA. 290--299."},{"key":"e_1_2_2_41_1","volume-title":"Property of average precision and its generalization: An examination of evaluation indicator for information retrieval experiments","author":"Kishida Kazuaki","unstructured":"Kazuaki Kishida. 2005. Property of average precision and its generalization: An examination of evaluation indicator for information retrieval experiments. National Institute of Informatics Tokyo, Japan."},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2020.3048606"},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3556537"},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3209978.3210003"},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1014614221234"},{"key":"e_1_2_2_46_1","doi-asserted-by":"crossref","unstructured":"Yao Lu Aakanksha Chowdhery Srikanth Kandula and Surajit Chaudhuri. 2018. Accelerating machine learning inference with probabilistic predicates. In SIGMOD. 1493--1508.","DOI":"10.1145\/3183713.3183751"},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2011.5995586"},{"key":"e_1_2_2_48_1","volume-title":"International conference on machine learning. PMLR, 8748--8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763."},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00207"},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/1390334.1390453"},{"key":"e_1_2_2_51_1","volume-title":"Zelda: Video Analytics using Vision-Language Models. arXiv:2305.03785 [cs.DB]","author":"Romero Francisco","year":"2023","unstructured":"Francisco Romero, Caleb Winston, Johann Hauswald, Matei Zaharia, and Christos Kozyrakis. 2023. Zelda: Video Analytics using Vision-Language Models. arXiv:2305.03785 [cs.DB]"},{"key":"e_1_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/ITSC45102.2020.9294422"},{"key":"e_1_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/IROS.2016.7759351"},{"key":"e_1_2_2_54_1","first-page":"506","article-title":"Efficient user-adaptable similarity search in large multimedia databases","volume":"97","author":"Seidl Thomas","year":"1997","unstructured":"Thomas Seidl and Hans-Peter Kriegel. 1997. Efficient user-adaptable similarity search in large multimedia databases. In VLDB, Vol. 97. 506--515.","journal-title":"VLDB"},{"key":"e_1_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00852"},{"key":"e_1_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/34.895972"},{"key":"e_1_2_2_57_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-32381-3_16"},{"key":"e_1_2_2_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00643"},{"key":"e_1_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/1646396.1646442"},{"key":"e_1_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3320230"},{"key":"e_1_2_2_61_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i4.16406"},{"key":"e_1_2_2_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3526142"},{"key":"e_1_2_2_63_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11633-018-1126-y"},{"key":"e_1_2_2_64_1","volume-title":"Zujin Guo, Kaiyang Zhou, Wayne Zhang, and Ziwei Liu.","author":"Yang Jingkang","year":"2022","unstructured":"Jingkang Yang, Yi Zhe Ang, Zujin Guo, Kaiyang Zhou, Wayne Zhang, and Ziwei Liu. 2022. Panoptic Scene Graph Generation. In ECCV."},{"key":"e_1_2_2_65_1","volume-title":"Proceedings 14th International Conference on Data Engineering. IEEE, 201--208","author":"Yi Byoung-Kee","year":"1998","unstructured":"Byoung-Kee Yi, Hosagrahar V Jagadish, and Christos Faloutsos. 1998. Efficient retrieval of similar time sequences under time warping. In Proceedings 14th International Conference on Data Engineering. IEEE, 201--208."},{"key":"e_1_2_2_66_1","doi-asserted-by":"publisher","DOI":"10.1006\/jvlc.1996.0022"},{"key":"e_1_2_2_67_1","doi-asserted-by":"publisher","DOI":"10.1109\/69.755617"},{"key":"e_1_2_2_68_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00271"},{"key":"e_1_2_2_69_1","volume-title":"A survey of autonomous driving: Common practices and emerging technologies","author":"Yurtsever Ekim","year":"2020","unstructured":"Ekim Yurtsever, Jacob Lambert, Alexander Carballo, and Kazuya Takeda. 2020. A survey of autonomous driving: Common practices and emerging technologies. IEEE access 8 (2020), 58443--58469."},{"key":"e_1_2_2_70_1","doi-asserted-by":"publisher","DOI":"10.14778\/3611479.3611482"},{"key":"e_1_2_2_71_1","volume-title":"Span-based localizing network for natural language video localization. arXiv preprint arXiv:2004.13931","author":"Zhang Hao","year":"2020","unstructured":"Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. 2020. Span-based localizing network for natural language video localization. arXiv preprint arXiv:2004.13931 (2020)."},{"key":"e_1_2_2_72_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3120745"},{"key":"e_1_2_2_73_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6984"},{"key":"e_1_2_2_74_1","volume-title":"Bytetrack: Multi-object tracking by associating every detection box. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27","author":"Zhang Yifu","year":"2022","unstructured":"Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, Ping Luo, Wenyu Liu, and Xinggang Wang. 2022. Bytetrack: Multi-object tracking by associating every detection box. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXII. Springer, 1--21."},{"key":"e_1_2_2_75_1","volume-title":"Improving sharpness-aware minimization with fisher mask for better generalization on language models. arXiv preprint arXiv:2210.05497","author":"Zhong Qihuang","year":"2022","unstructured":"Qihuang Zhong, Liang Ding, Li Shen, Peng Mi, Juhua Liu, Bo Du, and Dacheng Tao. 2022. Improving sharpness-aware minimization with fisher mask for better generalization on language models. arXiv preprint arXiv:2210.05497 (2022)."},{"key":"e_1_2_2_76_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01631"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3677140","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3677140","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T17:11:28Z","timestamp":1774977088000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3677140"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,9,30]]},"references-count":76,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,10,1]]}},"alternative-id":["10.1145\/3677140"],"URL":"https:\/\/doi.org\/10.1145\/3677140","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,9,30]]}}}