{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,31]],"date-time":"2026-01-31T00:45:34Z","timestamp":1769820334265,"version":"3.49.0"},"reference-count":79,"publisher":"Springer Science and Business Media LLC","issue":"10","license":[{"start":{"date-parts":[[2022,8,5]],"date-time":"2022-08-05T00:00:00Z","timestamp":1659657600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,8,5]],"date-time":"2022-08-05T00:00:00Z","timestamp":1659657600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100000266","name":"engineering and physical sciences research council","doi-asserted-by":"publisher","award":["EP\/R025290\/1"],"award-info":[{"award-number":["EP\/R025290\/1"]}],"id":[{"id":"10.13039\/501100000266","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100010669","name":"h2020 leit information and communication technologies","doi-asserted-by":"publisher","award":["957252"],"award-info":[{"award-number":["957252"]}],"id":[{"id":"10.13039\/100010669","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100010669","name":"H2020 LEIT Information and Communication Technologies","doi-asserted-by":"publisher","award":["951911"],"award-info":[{"award-number":["951911"]}],"id":[{"id":"10.13039\/100010669","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2022,10]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In this paper, we address the problem of high performance and computationally efficient content-based video retrieval in large-scale datasets. Current methods typically propose either: (i) fine-grained approaches employing spatio-temporal representations and similarity calculations, achieving high performance at a high computational cost or (ii) coarse-grained approaches representing\/indexing videos as global vectors, where the spatio-temporal structure is lost, providing low performance but also having low computational cost. In this work, we propose a Knowledge Distillation framework, called Distill-and-Select (DnS), that starting from a well-performing fine-grained Teacher Network learns: (a) Student Networks at different retrieval performance and computational efficiency trade-offs and (b) a Selector Network that at test time rapidly directs samples to the appropriate student to maintain both high retrieval performance and high computational efficiency. We train several students with different architectures and arrive at different trade-offs of performance and efficiency, i.e., speed and storage requirements, including fine-grained students that store\/index videos using binary representations. Importantly, the proposed scheme allows Knowledge Distillation in large, unlabelled datasets\u2014this leads to good students. We evaluate DnS on five public datasets on three different video retrieval tasks and demonstrate (a) that our students achieve state-of-the-art performance in several cases and (b) that the DnS framework provides an excellent trade-off between retrieval performance, computational speed, and storage space. In specific configurations, the proposed method achieves similar mAP with the teacher but is 20 times faster and requires 240 times less storage space. The collected dataset and implementation are publicly available: <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/mever-team\/distill-and-select\">https:\/\/github.com\/mever-team\/distill-and-select<\/jats:ext-link>.<\/jats:p>","DOI":"10.1007\/s11263-022-01651-3","type":"journal-article","created":{"date-parts":[[2022,8,5]],"date-time":"2022-08-05T10:07:05Z","timestamp":1659694025000},"page":"2385-2407","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":30,"title":["DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval"],"prefix":"10.1007","volume":"130","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2297-4802","authenticated-orcid":false,"given":"Giorgos","family":"Kordopatis-Zilos","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2036-9089","authenticated-orcid":false,"given":"Christos","family":"Tzelepis","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5441-7341","authenticated-orcid":false,"given":"Symeon","family":"Papadopoulos","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6447-9020","authenticated-orcid":false,"given":"Ioannis","family":"Kompatsiaris","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3913-4738","authenticated-orcid":false,"given":"Ioannis","family":"Patras","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2022,8,5]]},"reference":[{"key":"1651_CR1","doi-asserted-by":"crossref","unstructured":"Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T.,\u00a0 & Sivic, J. (2016). NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition","DOI":"10.1109\/CVPR.2016.572"},{"key":"1651_CR2","unstructured":"Ba, J. L., Kiros, J. R.,\u00a0 & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450"},{"key":"1651_CR3","doi-asserted-by":"crossref","unstructured":"Baraldi, L., Douze, M., Cucchiara, R.,\u00a0 & J\u00e9gou, H. (2018). LAMV: Learning to align and match videos with kernelized temporal layers. In Proceedings of the IEEE conference on computer vision and pattern recognition","DOI":"10.1109\/CVPR.2018.00814"},{"key":"1651_CR4","doi-asserted-by":"crossref","unstructured":"Bhardwaj, S., Srinivasan, M.,\u00a0 & Khapra, M. M. (2019). Efficient video classification using fewer frames. In Proceedings of the IEEE conference on computer vision and pattern recognition","DOI":"10.1109\/CVPR.2019.00044"},{"key":"1651_CR5","unstructured":"Bishay, M., Zoumpourlis, G.,\u00a0 & Patras, I. (2019). TARN: Temporal attentive relation network for few-shot and zero-shot action recognition. In Proceedings of the British machine vision conference"},{"key":"1651_CR6","doi-asserted-by":"crossref","unstructured":"Cai, Y., Yang, L., Ping, W., Wang, F., Mei, T., Hua, X. S.,\u00a0 & Li, S. (2011). Million-scale near-duplicate video retrieval system. In Proceedings of the ACM international conference on multimedia. ACM.","DOI":"10.1145\/2072298.2072484"},{"key":"1651_CR7","doi-asserted-by":"crossref","unstructured":"Cao, Z., Long, M., Wang, J.,\u00a0 & Yu, P.S. (2017). HashNet: Deep learning to hash by continuation. In Proceedings of the IEEE international conference on computer vision","DOI":"10.1109\/ICCV.2017.598"},{"issue":"3","key":"1651_CR8","doi-asserted-by":"publisher","first-page":"382","DOI":"10.1109\/TMM.2015.2391674","volume":"17","author":"CL Chou","year":"2015","unstructured":"Chou, C. L., Chen, H. T., & Lee, S. Y. (2015). Pattern-based near-duplicate video retrieval and localization on web-scale videos. IEEE Transactions on Multimedia, 17(3), 382\u2013395.","journal-title":"IEEE Transactions on Multimedia"},{"key":"1651_CR9","doi-asserted-by":"crossref","unstructured":"Chum, O., Philbin, J., Sivic, J., Isard, M.,\u00a0 & Zisserman, A. (2007). Total recall: Automatic query expansion with a generative feature model for object retrieval. In Proceedings of the IEEE international conference on computer vision","DOI":"10.1109\/ICCV.2007.4408891"},{"key":"1651_CR10","doi-asserted-by":"crossref","unstructured":"Crasto, N., Weinzaepfel, P., Alahari, K.,\u00a0 & Schmid, C. (2019). Mars: Motion-augmented rgb stream for action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition","DOI":"10.1109\/CVPR.2019.00807"},{"key":"1651_CR11","doi-asserted-by":"crossref","unstructured":"Deng, J., Pan, Y., Yao, T., Zhou, W., Li, H.,\u00a0 & Mei, T. (2019). Relation distillation networks for video object detection. In Proceedings of the IEEE international conference on computer vision","DOI":"10.1109\/ICCV.2019.00712"},{"key":"1651_CR12","doi-asserted-by":"crossref","unstructured":"Douze, M., Revaud, J., Schmid, C.,\u00a0 & J\u00e9gou, H. (2013). Stable hyper-pooling and query expansion for event detection. In Proceedings of the IEEE international conference on computer vision (pp 1825\u20131832).","DOI":"10.1109\/ICCV.2013.229"},{"issue":"4","key":"1651_CR13","doi-asserted-by":"publisher","first-page":"257","DOI":"10.1109\/TMM.2010.2046265","volume":"12","author":"M Douze","year":"2010","unstructured":"Douze, M., J\u00e9gou, H., & Schmid, C. (2010). An image-based approach to video copy detection with spatio-temporal post-filtering. IEEE Transactions on Multimedia, 12(4), 257\u2013266.","journal-title":"IEEE Transactions on Multimedia"},{"key":"1651_CR14","doi-asserted-by":"crossref","unstructured":"Feng, Y., Ma, L., Liu, W., Zhang, T., & Luo, J. (2018). Video re-localization. In Proceedings of the European conference on computer vision","DOI":"10.1007\/978-3-030-01264-9_4"},{"key":"1651_CR15","doi-asserted-by":"crossref","unstructured":"Gao, Z., Hua, G., Zhang, D., Jojic, N., Wang, L., Xue, J.,\u00a0 & Zheng, N. (2017). ER3: A unified framework for event retrieval, recognition and recounting. In Proceedings of the IEEE conference on computer vision and pattern recognition","DOI":"10.1109\/CVPR.2017.227"},{"key":"1651_CR16","doi-asserted-by":"crossref","unstructured":"Garcia, N. C., Morerio, P.,\u00a0 & Murino, V. (2018). Modality distillation with multiple stream networks for action recognition. In Proceedings of the European conference on computer vision","DOI":"10.1007\/978-3-030-01237-3_7"},{"issue":"12","key":"1651_CR17","doi-asserted-by":"publisher","first-page":"2916","DOI":"10.1109\/TPAMI.2012.193","volume":"35","author":"Y Gong","year":"2012","unstructured":"Gong, Y., Lazebnik, S., Gordo, A., & Perronnin, F. (2012). Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2916\u20132929.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"1651_CR18","doi-asserted-by":"crossref","unstructured":"Gordo, A., Radenovic, F.,\u00a0 & Berg, T. (2020). Attention-based query expansion learning. In Proceedings of the European conference on computer vision","DOI":"10.1007\/978-3-030-58604-1_11"},{"key":"1651_CR19","doi-asserted-by":"publisher","first-page":"1789","DOI":"10.1007\/s11263-021-01453-z","volume":"129","author":"J Gou","year":"2021","unstructured":"Gou, J., Yu, B., Maybank, S. J., & Tao, D. (2021). Knowledge distillation: A survey. International Journal of Computer Vision, 129, 1789\u20131819.","journal-title":"International Journal of Computer Vision"},{"key":"1651_CR20","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S.,\u00a0 & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.","DOI":"10.1109\/CVPR.2016.90"},{"key":"1651_CR21","unstructured":"Hinton, G., Vinyals, O.,\u00a0 & Dean, J. (2015). Distilling the knowledge in a neural network. In Proceedings of the international conference on neural information processing systems"},{"issue":"5","key":"1651_CR22","doi-asserted-by":"publisher","first-page":"386","DOI":"10.1109\/TMM.2010.2050737","volume":"12","author":"Z Huang","year":"2010","unstructured":"Huang, Z., Shen, H. T., Shao, J., Cui, B., & Zhou, X. (2010). Practical online near-duplicate subsequence detection for continuous video streams. IEEE Transactions on Multimedia, 12(5), 386\u2013398.","journal-title":"IEEE Transactions on Multimedia"},{"key":"1651_CR23","unstructured":"Ioffe, S.,\u00a0 & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the international conference on machine learning"},{"key":"1651_CR24","doi-asserted-by":"crossref","unstructured":"J\u00e9gou, H.,\u00a0 & Chum, O. (2012). Negative evidences and co-occurences in image retrieval: The benefit of pca and whitening. In Proceedings of the European conference on computer vision (pp 774\u2013787). Springer.","DOI":"10.1007\/978-3-642-33709-3_55"},{"key":"1651_CR25","doi-asserted-by":"crossref","unstructured":"Jiang, Q. Y., He, Y., Li, G., Lin, J., Li, L.,\u00a0 & Li, W. J. (2019). SVD: A large-scale short video dataset for near-duplicate video retrieval. In Proceedings of the IEEE international conference on computer vision","DOI":"10.1109\/ICCV.2019.00538"},{"key":"1651_CR26","doi-asserted-by":"crossref","unstructured":"Jiang, Y.G., Jiang, Y.,\u00a0 & Wang, J. (2014). VCDB: A large-scale database for partial copy detection in videos. In Proceedings of the European conference on computer vision (pp. 357\u2013371). Springer.","DOI":"10.1007\/978-3-319-10593-2_24"},{"issue":"1","key":"1651_CR27","doi-asserted-by":"publisher","first-page":"32","DOI":"10.1109\/TBDATA.2016.2530714","volume":"2","author":"YG Jiang","year":"2016","unstructured":"Jiang, Y. G., & Wang, J. (2016). Partial copy detection in videos: A benchmark and an evaluation of popular methods. IEEE Transactions on Big Data, 2(1), 32\u201342.","journal-title":"IEEE Transactions on Big Data"},{"key":"1651_CR28","unstructured":"Kingma, D.,\u00a0 & Ba, J. (2015). Adam: A method for stochastic optimization. In Proceedings of the international conference on learning representations."},{"key":"1651_CR29","doi-asserted-by":"crossref","unstructured":"Kordopatis-Zilos, G., Papadopoulos, S., Patras, I.,\u00a0 & Kompatsiaris, I. (2017a). Near-duplicate video retrieval by aggregating intermediate cnn layers. In Proceedings of the international conference on multimedia modeling (pp. 251\u2013263). Springer.","DOI":"10.1007\/978-3-319-51811-4_21"},{"key":"1651_CR30","doi-asserted-by":"crossref","unstructured":"Kordopatis-Zilos, G., Papadopoulos, S., Patras, I.,\u00a0 & Kompatsiaris, I. (2017b). Near-duplicate video retrieval with deep metric learning. In Proceedings of the IEEE international conference on computer vision workshops (pp. 347\u2013356). IEEE.","DOI":"10.1109\/ICCVW.2017.49"},{"key":"1651_CR31","doi-asserted-by":"crossref","unstructured":"Kordopatis-Zilos, G., Papadopoulos, S., & Patras, I.,\u00a0 & Kompatsiaris, I. (2019a). FIVR: Fine-grained incident video retrieval. IEEE Transactions on Multimedia,21, 2638\u20132652.","DOI":"10.1109\/TMM.2019.2905741"},{"key":"1651_CR32","doi-asserted-by":"crossref","unstructured":"Kordopatis-Zilos, G., Papadopoulos, S., Patras, I.,\u00a0 & Kompatsiaris, I. (2019b). ViSiL: Fine-grained spatio-temporal video similarity learning. In Proceedings of the IEEE international conference on computer vision.","DOI":"10.1109\/ICCV.2019.00645"},{"key":"1651_CR33","unstructured":"Krizhevsky, A., Sutskever, I.,\u00a0 & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Proceedings of the international conference on neural information processing systems."},{"key":"1651_CR34","doi-asserted-by":"crossref","unstructured":"Lassance, C., Bontonou, M., Hacene, G. B., Gripon, V., Tang, J.,\u00a0 & Ortega, A. (2020). Deep geometric knowledge distillation with graphs. In Proceedings of the IEEE international conference on acoustics, speech and signal processing.","DOI":"10.1109\/ICASSP40776.2020.9053986"},{"key":"1651_CR35","doi-asserted-by":"crossref","unstructured":"Lee, J., Abu-El-Haija, S., Varadarajan, B.,\u00a0 & Natsev, A. (2018). Collaborative deep metric learning for video understanding. In Proceedings of the ACM SIGKDD international conference on knowledge discovery & data mining.","DOI":"10.1145\/3219819.3219856"},{"key":"1651_CR36","doi-asserted-by":"crossref","unstructured":"Lee, H., Lee, J., Ng, J. Y. H.,\u00a0 & Natsev, P. (2020). Large scale video representation learning via relational graph clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition.","DOI":"10.1109\/CVPR42600.2020.00684"},{"key":"1651_CR37","doi-asserted-by":"crossref","unstructured":"Li, Q., Jin, S.,\u00a0 & Yan, J. (2017). Mimicking very efficient network for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition.","DOI":"10.1109\/CVPR.2017.776"},{"key":"1651_CR38","doi-asserted-by":"crossref","unstructured":"Liang, D., Lin, L., Wang, R., Shao, J., Wang, C.,\u00a0 & Chen, Y. W. (2019). Unsupervised teacher\u2013student model for large-scale video retrieval. In Proceedings of the IEEE international conference on computer vision workshops","DOI":"10.1109\/ICCVW.2019.00232"},{"key":"1651_CR39","doi-asserted-by":"crossref","unstructured":"Liang, S.,\u00a0 & Wang, P. (2020). An efficient hierarchical near-duplicate video detection algorithm based on deep semantic features. In Proceedings of the international conference on multimedia modeling","DOI":"10.1007\/978-3-030-37731-1_61"},{"issue":"12","key":"1651_CR40","doi-asserted-by":"publisher","first-page":"3743","DOI":"10.1109\/TCSVT.2018.2884941","volume":"29","author":"K Liao","year":"2018","unstructured":"Liao, K., Lei, H., Zheng, Y., Lin, G., Cao, C., Zhang, M., & Ding, J. (2018). IR feature embedded bof indexing method for near-duplicate video retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 29(12), 3743\u20133753.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"issue":"6","key":"1651_CR41","doi-asserted-by":"publisher","first-page":"1209","DOI":"10.1109\/TMM.2016.2645404","volume":"19","author":"VE Liong","year":"2017","unstructured":"Liong, V. E., Lu, J., Tan, Y. P., & Zhou, J. (2017). Deep video hashing. IEEE Transactions on Multimedia, 19(6), 1209\u20131219.","journal-title":"IEEE Transactions on Multimedia"},{"key":"1651_CR42","doi-asserted-by":"crossref","unstructured":"Liu, Y., Cao, J., Li, B., Yuan, C., Hu, W., Li, Y., & Duan, Y. (2019). Knowledge distillation via instance relationship graph. In Proceedings of the IEEE conference on computer vision and pattern recognition","DOI":"10.1109\/CVPR.2019.00726"},{"issue":"22","key":"1651_CR43","doi-asserted-by":"publisher","first-page":"24435","DOI":"10.1007\/s11042-016-4176-6","volume":"76","author":"H Liu","year":"2017","unstructured":"Liu, H., Zhao, Q., Wang, H., Lv, P., & Chen, Y. (2017). An image-based near-duplicate video retrieval and localization using improved edit distance. Multimedia Tools and Applications, 76(22), 24435\u201324456.","journal-title":"Multimedia Tools and Applications"},{"key":"1651_CR44","doi-asserted-by":"crossref","unstructured":"Luo, Z., Hsieh, J. T., Jiang, L., Niebles, J. C., & Fei-Fei, L. (2018). Graph distillation for action detection with privileged modalities. In Proceedings of the European conference on computer vision","DOI":"10.1007\/978-3-030-01264-9_11"},{"key":"1651_CR45","doi-asserted-by":"crossref","unstructured":"Markatopoulou, F., Galanopoulos, D., Mezaris, V., & Patras, I. (2017). Query and keyframe representations for ad-hoc video search. In Proceedings of the 2017 ACM on international conference on multimedia retrieval (pp. 407\u2013411).","DOI":"10.1145\/3078971.3079041"},{"issue":"6","key":"1651_CR46","doi-asserted-by":"publisher","first-page":"1631","DOI":"10.1109\/TCSVT.2018.2848458","volume":"29","author":"F Markatopoulou","year":"2018","unstructured":"Markatopoulou, F., Mezaris, V., & Patras, I. (2018). Implicit and explicit concept relations in deep neural networks for multi-label video\/image annotation. IEEE Transactions on Circuits and Systems for Video Technology, 29(6), 1631\u20131644.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"1651_CR47","unstructured":"Miech, A., Laptev, I., & Sivic, J. (2017). Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905"},{"key":"1651_CR48","doi-asserted-by":"crossref","unstructured":"Pan, B., Cai, H., Huang, D. A., Lee, K. H., Gaidon, A., Adeli, E., & Niebles, J. C. (2020). Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE conference on computer vision and pattern recognition.","DOI":"10.1109\/CVPR42600.2020.01088"},{"key":"1651_CR49","doi-asserted-by":"crossref","unstructured":"Park, W., Kim, D., Lu, Y., & Cho, M. (2019). Relational knowledge distillation. In Proceedings of the IEEE conference on computer vision and pattern recognition.","DOI":"10.1109\/CVPR.2019.00409"},{"key":"1651_CR50","unstructured":"Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., & Antiga, L., et\u00a0al. (2019). PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the international conference on neural information processing systems."},{"key":"1651_CR51","doi-asserted-by":"crossref","unstructured":"Peng, B., Jin, X., Liu, J., Li, D., Wu, Y., Liu, Y., Zhou, S., & Zhang, Z. (2019). Correlation congruence for knowledge distillation. In Proceedings of the IEEE international conference on computer vision.","DOI":"10.1109\/ICCV.2019.00511"},{"key":"1651_CR52","doi-asserted-by":"crossref","unstructured":"Piergiovanni, A., Angelova, A., & Ryoo, M.S. (2020). Evolving losses for unsupervised video representation learning. In Proceedings of the IEEE conference on computer vision and pattern recognition","DOI":"10.1109\/CVPR42600.2020.00021"},{"key":"1651_CR53","doi-asserted-by":"crossref","unstructured":"Poullot, S., Tsukatani, S., Phuong\u00a0Nguyen, A., J\u00e9gou, H., & Satoh, S. (2015). Temporal matching kernel with explicit feature maps. In Proceedings of the ACM international conference on multimedia","DOI":"10.1145\/2733373.2806228"},{"key":"1651_CR54","doi-asserted-by":"crossref","unstructured":"Revaud, J., Douze, M., Schmid, C., & J\u00e9gou, H. (2013). Event retrieval in large video collections with circulant temporal encoding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2459\u20132466). IEEE.","DOI":"10.1109\/CVPR.2013.318"},{"key":"1651_CR55","doi-asserted-by":"crossref","unstructured":"Shao, J., Wen, X., Zhao, B., & Xue, X. (2021). Temporal context aggregation for video retrieval with contrastive learning. In Proceedings of the IEEE winter conference on applications of computer vision","DOI":"10.1109\/WACV48630.2021.00331"},{"key":"1651_CR56","doi-asserted-by":"crossref","unstructured":"Shmelkov, K., Schmid, C., & Alahari, K. (2017). Incremental learning of object detectors without catastrophic forgetting. In Proceedings of the IEEE international conference on computer vision","DOI":"10.1109\/ICCV.2017.368"},{"key":"1651_CR57","doi-asserted-by":"crossref","unstructured":"Sivic, J., & Zisserman, A. (2003). Video Google: A text retrieval approach to object matching in videos. In Proceedings of the IEEE conference on computer vision and pattern recognition","DOI":"10.1109\/ICCV.2003.1238663"},{"key":"1651_CR58","doi-asserted-by":"crossref","unstructured":"Song, J., Yang, Y., Huang, Z., Shen, H. T., & Hong, R. (2011). Multiple feature hashing for real-time large scale near-duplicate video retrieval. In Proceedings of the 19th ACM international conference on multimedia","DOI":"10.1145\/2072298.2072354"},{"issue":"8","key":"1651_CR59","doi-asserted-by":"publisher","first-page":"1997","DOI":"10.1109\/TMM.2013.2271746","volume":"15","author":"J Song","year":"2013","unstructured":"Song, J., Yang, Y., Huang, Z., Shen, H. T., & Luo, J. (2013). Effective multiple feature hashing for large-scale near-duplicate video retrieval. IEEE Transactions on Multimedia, 15(8), 1997\u20132008.","journal-title":"IEEE Transactions on Multimedia"},{"issue":"7","key":"1651_CR60","doi-asserted-by":"publisher","first-page":"3210","DOI":"10.1109\/TIP.2018.2814344","volume":"27","author":"J Song","year":"2018","unstructured":"Song, J., Zhang, H., Li, X., Gao, L., Wang, M., & Hong, R. (2018). Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Transactions on Image Processing, 27(7), 3210\u20133221.","journal-title":"IEEE Transactions on Image Processing"},{"key":"1651_CR61","doi-asserted-by":"crossref","unstructured":"Stroud, J., Ross, D., Sun, C., Deng, J., & Sukthankar, R. (2020). D3d: Distilled 3d networks for video action recognition. In Proceedings of the IEEE winter conference on applications of computer vision.","DOI":"10.1109\/WACV45572.2020.9093274"},{"key":"1651_CR62","doi-asserted-by":"crossref","unstructured":"Tan, H. K., Ngo, C. W., Hong, R., & Chua, T. S. (2009). Scalable detection of partial near-duplicate videos by visual-temporal consistency. In Proceedings of the ACM international conference on Multimedia.","DOI":"10.1145\/1631272.1631295"},{"key":"1651_CR63","doi-asserted-by":"crossref","unstructured":"Tavakolian, M., Tavakoli, H. R., & Hadid, A. (2019). AWSD: Adaptive weighted spatiotemporal distillation for video representation. In Proceedings of the IEEE international conference on computer vision.","DOI":"10.1109\/ICCV.2019.00811"},{"key":"1651_CR64","doi-asserted-by":"crossref","unstructured":"Thoker, F. M., & Gall, J. (2019). Cross-modal knowledge distillation for action recognition. In Proceedings of the IEEE international conference on image processing.","DOI":"10.1109\/ICIP.2019.8802909"},{"key":"1651_CR65","unstructured":"Tolias, G., Sicre, R., & J\u00e9gou, H. (2016). Particular object retrieval with integral max-pooling of cnn activations. In Proceedings of the international conference on learning representations."},{"key":"1651_CR66","unstructured":"Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & J\u00e9gou, H. (2020). Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877"},{"key":"1651_CR67","doi-asserted-by":"crossref","unstructured":"Tung, F., & Mori, G. (2019). Similarity-preserving knowledge distillation. In Proceedings of the IEEE international conference on computer vision.","DOI":"10.1109\/ICCV.2019.00145"},{"key":"1651_CR68","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Proceedings of the international conference on neural information processing systems"},{"key":"1651_CR69","doi-asserted-by":"crossref","unstructured":"Wang, L., Bao, Y., Li, H., Fan, X., & Luo, Z. (2017). Compact cnn based video representation for efficient video copy detection. In Proceedings of the international conference on multimedia modeling.","DOI":"10.1007\/978-3-319-51811-4_47"},{"key":"1651_CR70","doi-asserted-by":"crossref","unstructured":"Wang, K. H., Cheng, C. C., Chen, Y. L., Song, Y., & Lai, S. H. (2021). Attention-based deep metric learning for near-duplicate video retrieval. In Proceedings of the international conference on pattern recognition.","DOI":"10.1109\/ICPR48806.2021.9412710"},{"key":"1651_CR71","doi-asserted-by":"crossref","unstructured":"Wu, X., Hauptmann, A. G., & Ngo, C. W. (2007). Practical elimination of near-duplicates from web video search. In Proceedings of the ACM international conference on multimedia.","DOI":"10.1145\/1291233.1291280"},{"key":"1651_CR72","doi-asserted-by":"crossref","unstructured":"Xie, Q., Luong, M. T., Hovy, E., & Le, Q. V. (2020). Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE conference on computer vision and pattern recognition","DOI":"10.1109\/CVPR42600.2020.01070"},{"key":"1651_CR73","unstructured":"Yalniz, I. Z., J\u2019egou, H., Chen, K., Paluri, M., & Mahajan, D. (2019). Billion-scale semi-supervised learning for image classification. arXiv preprint arXiv:1905.00546"},{"key":"1651_CR74","doi-asserted-by":"crossref","unstructured":"Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). Hierarchical attention networks for document classification. In Proceedings of the conference of the North American chapter of the association for computational linguistics: Human language technologies.","DOI":"10.18653\/v1\/N16-1174"},{"issue":"1","key":"1651_CR75","doi-asserted-by":"publisher","first-page":"311","DOI":"10.1007\/s11042-018-5862-3","volume":"78","author":"Y Yang","year":"2019","unstructured":"Yang, Y., Tian, Y., & Huang, T. (2019). Multiscale video sequence matching for near-duplicate detection and retrieval. Multimedia Tools and Applications, 78(1), 311\u2013336.","journal-title":"Multimedia Tools and Applications"},{"key":"1651_CR76","doi-asserted-by":"crossref","unstructured":"Yuan, L., Wang, T., Zhang, X., Tay, F. E., Jie, Z., Liu, W., & Feng, J. (2020). Central similarity quantization for efficient image and video retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition.","DOI":"10.1109\/CVPR42600.2020.00315"},{"key":"1651_CR77","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Shi, Y., Yuan, C., Li, B., Wang, P., Hu, W., & Zha, Z. J. (2020). Object relational graph with teacher-recommended learning for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition.","DOI":"10.1109\/CVPR42600.2020.01329"},{"key":"1651_CR78","doi-asserted-by":"crossref","unstructured":"Zhang, C., & Peng, Y. (2018). Better and faster: Knowledge transfer from multiple self-supervised learning tasks via graph distillation for video classification. In Proceedings of the international joint conference on artificial intelligence.","DOI":"10.24963\/ijcai.2018\/158"},{"key":"1651_CR79","doi-asserted-by":"crossref","unstructured":"Zhao, Z., Chen, G., Chen, C., Li, X., Xiang, X., Zhao, Y., & Su, F. (2019). Instance-based video search via multi-task retrieval and re-ranking. In Proceedings of the IEEE\/CVF international conference on computer vision workshops.","DOI":"10.1109\/ICCVW.2019.00234"}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-022-01651-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-022-01651-3\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-022-01651-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,9,10]],"date-time":"2022-09-10T10:09:28Z","timestamp":1662804568000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-022-01651-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,8,5]]},"references-count":79,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2022,10]]}},"alternative-id":["1651"],"URL":"https:\/\/doi.org\/10.1007\/s11263-022-01651-3","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,8,5]]},"assertion":[{"value":"24 June 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 July 2022","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"5 August 2022","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}