{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,12]],"date-time":"2026-07-12T02:32:05Z","timestamp":1783823525363,"version":"3.55.0"},"reference-count":41,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2024,3,26]],"date-time":"2024-03-26T00:00:00Z","timestamp":1711411200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,3,26]],"date-time":"2024-03-26T00:00:00Z","timestamp":1711411200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Hochschule f\u00fcr Technik und Wirtschaft Berlin"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Multimed Info Retr"],"published-print":{"date-parts":[[2024,6]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    CLIP-based text-to-image retrieval has proven to be very effective at the interactive video retrieval competition\n                    <jats:italic>Video Browser Showdown<\/jats:italic>\n                    2022, where all three top-scoring teams had implemented a variant of a CLIP model in their system. Since the performance of these three systems was quite close, this post-evaluation was designed to get better insights on the differences of the systems and compare the CLIP-based text-query retrieval engines by introducing slight modifications to the original competition settings. An extended analysis of the overall results and the retrieval performance of all systems\u2019 functionalities shows that a strong text retrieval model certainly helps, but has to be coupled with extensive browsing capabilities and other query-modalities to consistently solve known-item-search tasks in a large-scale video database.\n                  <\/jats:p>","DOI":"10.1007\/s13735-024-00325-9","type":"journal-article","created":{"date-parts":[[2024,3,26]],"date-time":"2024-03-26T12:02:19Z","timestamp":1711454539000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Interactive multimodal video search: an extended post-evaluation for the VBS 2022 competition"],"prefix":"10.1007","volume":"13","author":[{"given":"Konstantin","family":"Schall","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Werner","family":"Bailer","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Kai-Uwe","family":"Barthel","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Fabio","family":"Carrara","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jakub","family":"Loko\u010d","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ladislav","family":"Pe\u0161ka","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Klaus","family":"Schoeffmann","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Lucia","family":"Vadicamo","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Claudio","family":"Vairo","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2024,3,26]]},"reference":[{"key":"325_CR1","unstructured":"Radford A et\u00a0al (2021) Learning transferable visual models from natural language supervision. In: Meila M, Zhang T (eds.) Proceedings of the 38th international conference on machine learning, vol 139 of Proceedings of machine learning research, pp 8748\u20138763 (PMLR). https:\/\/proceedings.mlr.press\/v139\/radford21a.html"},{"key":"325_CR2","doi-asserted-by":"publisher","DOI":"10.1007\/s00530-023-01143-5","author":"J Loko\u010d","year":"2023","unstructured":"Loko\u010d J et al (2023) Interactive video retrieval in the age of effective joint embedding deep models: lessons from the 11th VBS. Multimed Syst. https:\/\/doi.org\/10.1007\/s00530-023-01143-5","journal-title":"Multimed Syst"},{"key":"325_CR3","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s13735-021-00225-2","volume":"11","author":"S Heller","year":"2022","unstructured":"Heller S et al (2022) Interactive video retrieval evaluation at a distance: comparing sixteen interactive video search systems in a remote setting at the 10th video browser showdown. Int J Multimed Inf Retr 11:1\u201318. https:\/\/doi.org\/10.1007\/s13735-021-00225-2","journal-title":"Int J Multimed Inf Retr"},{"key":"325_CR4","doi-asserted-by":"publisher","DOI":"10.1145\/3445031","author":"J Loko\u010d","year":"2021","unstructured":"Loko\u010d J et al (2021) Is the reign of interactive search eternal? Findings from the video browser showdown 2020. ACM Trans Multimed Comput Commun Appl (TOMM). https:\/\/doi.org\/10.1145\/3445031","journal-title":"ACM Trans Multimed Comput Commun Appl (TOMM)"},{"key":"325_CR5","doi-asserted-by":"crossref","unstructured":"Gurrin C et\u00a0al (2023) Introduction to the sixth annual lifelog search challenge, LSC\u201923. In: Kompatsiaris IY, et\u00a0al (eds.) Proceedings international conference on multimedia retrieval (ICMR\u201923) (ACM, Thessaloniki, Greece)","DOI":"10.1145\/3591106.3592304"},{"key":"325_CR6","unstructured":"Awad G et\u00a0al (2022) An overview on the evaluated video retrieval tasks at trecvid 2022. In: Awad G (ed.) Proceedings of TRECVID 2022 (NIST, USA)"},{"key":"325_CR7","first-page":"1","volume":"12","author":"MG Constantin","year":"2020","unstructured":"Constantin MG, Hicks S, Larson M, Nguyen N-T (2020) MediaEval multimedia evaluation benchmark: tenth anniversary and counting. ACM SIGMM Rec 12:1\u20131","journal-title":"ACM SIGMM Rec"},{"key":"325_CR8","doi-asserted-by":"crossref","unstructured":"Loko\u010d J et\u00a0al (2022) A task category space for user-centric comparative multimedia search evaluations. In: \u00de\u00f3r J\u00f3nsson B, et\u00a0al (eds.) International conference on multimedia modeling","DOI":"10.1007\/978-3-030-98358-1_16"},{"key":"325_CR9","doi-asserted-by":"publisher","first-page":"3361","DOI":"10.1109\/TMM.2018.2830110","volume":"20","author":"J Loko\u010d","year":"2018","unstructured":"Loko\u010d J, Bailer W, Schoeffmann K, M\u00fcnzer B, Awad G (2018) On influential trends in interactive video retrieval: video browser showdown 2015\u20132017. IEEE Trans Multimed 20:3361\u20133376","journal-title":"IEEE Trans Multimed"},{"key":"325_CR10","doi-asserted-by":"publisher","unstructured":"Gurrin C et\u00a0al (2022) Introduction to the fifth annual lifelog search challenge, LSC\u201922. In: Oria V, et\u00a0al (eds.) ICMR\u201922: international conference on multimedia retrieval, Newark, June 27\u201330, 2022, pp 685\u2013687 (ACM). https:\/\/doi.org\/10.1145\/3512527.3531439","DOI":"10.1145\/3512527.3531439"},{"key":"325_CR11","doi-asserted-by":"publisher","first-page":"30982","DOI":"10.1109\/ACCESS.2023.3248284","volume":"11","author":"L Tran","year":"2023","unstructured":"Tran L et al (2023) Comparing interactive retrieval approaches at the lifelog search challenge 2021. IEEE Access 11:30982\u201330995. https:\/\/doi.org\/10.1109\/ACCESS.2023.3248284","journal-title":"IEEE Access"},{"key":"325_CR12","doi-asserted-by":"publisher","DOI":"10.1109\/MMUL.2021.3066779","author":"L Rossetto","year":"2021","unstructured":"Rossetto L et al (2021) On the user-centric comparative remote evaluation of interactive video search systems. IEEE MultiMed. https:\/\/doi.org\/10.1109\/MMUL.2021.3066779","journal-title":"IEEE MultiMed"},{"key":"325_CR13","doi-asserted-by":"crossref","unstructured":"Hezel N, Schall K, Jung K, Barthel KU (2022) Efficient search and browsing of large-scale video collections with vibro. In: \u00de\u00f3r J\u00f3nsson B, et\u00a0al (eds.) MultiMedia modeling. Springer, Cham, pp 487\u2013492","DOI":"10.1007\/978-3-030-98355-0_43"},{"key":"325_CR14","doi-asserted-by":"crossref","unstructured":"Loko\u010d J, Mejzl\u00edk F, Sou\u010dek T, Dokoupil P, Pe\u0161ka L (2022) Video search with context-aware ranker and relevance feedback. In: \u00de\u00f3r J\u00f3nsson, B. et\u00a0al (eds.) MultiMedia modeling. Springer Cham, pp 505\u2013510","DOI":"10.1007\/978-3-030-98355-0_46"},{"key":"325_CR15","doi-asserted-by":"crossref","unstructured":"Amato G et\u00a0al (2022) Visione at video browser showdown 2022. In: \u00de\u00f3r J\u00f3nsson B, et\u00a0al (eds.) MultiMedia modeling. Springer, Cham, pp 543\u2013548","DOI":"10.1007\/978-3-030-98355-0_52"},{"key":"325_CR16","doi-asserted-by":"crossref","unstructured":"He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR)","DOI":"10.1109\/CVPR.2016.90"},{"key":"325_CR17","doi-asserted-by":"crossref","unstructured":"Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching. In: 2007 IEEE conference on computer vision and pattern recognition (IEEE Computer Society)","DOI":"10.1109\/CVPR.2007.383172"},{"key":"325_CR18","unstructured":"Dosovitskiy A et\u00a0al (2020) An image is worth $$16 \\times 16$$ words: transformers for image recognition at scale. In: CoRR"},{"key":"325_CR19","doi-asserted-by":"crossref","unstructured":"Messina N, Falchi F, Esuli A, Amato G (2021) Transformer reasoning network for image\u2013text matching and retrieval. In: 2020 25th International conference on pattern recognition (ICPR), pp 5222\u20135229 (IEEE)","DOI":"10.1109\/ICPR48806.2021.9413172"},{"key":"325_CR20","unstructured":"Fang H, Xiong P, Xu L, Chen Y (2021) Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097"},{"key":"325_CR21","doi-asserted-by":"crossref","unstructured":"Liu Z et\u00a0al (2021) Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"325_CR22","doi-asserted-by":"publisher","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","volume":"115","author":"O Russakovsky","year":"2015","unstructured":"Russakovsky O et al (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115:211","journal-title":"Int J Comput Vis"},{"key":"325_CR23","doi-asserted-by":"crossref","unstructured":"Kim S, Kim D, Cho M, Kwak S (2020) Proxy anchor loss for deep metric learning. In: IEEE\/CVF conference on computer vision and pattern recognition (CVPR)","DOI":"10.1109\/CVPR42600.2020.00330"},{"key":"325_CR24","doi-asserted-by":"publisher","unstructured":"Cox I, Miller M, Omohundro S, Yianilos P (1996) Pichunter: Bayesian relevance feedback for image retrieval. In: International conference on pattern recognition, vol\u00a03, pp 361\u2013369 (IEEE). https:\/\/doi.org\/10.1109\/ICPR.1996.546971","DOI":"10.1109\/ICPR.1996.546971"},{"key":"325_CR25","doi-asserted-by":"publisher","unstructured":"Lokoc J, Peska L (2023) A study of a cross-modal interactive search tool using CLIP and temporal fusion. Dang-Nguyen D et\u00a0al (eds.) MultiMedia modeling\u201429th international conference, MMM 2023, Bergen, Norway, January 9\u201312, 2023, Proceedings, Part I, Vol. 13833 of Lecture Notes in Computer Science. Springer, pp 397\u2013408. https:\/\/doi.org\/10.1007\/978-3-031-27077-2_31","DOI":"10.1007\/978-3-031-27077-2_31"},{"key":"325_CR26","doi-asserted-by":"publisher","unstructured":"Revaud J, Almazan J, Rezende R, de\u00a0Souza C (2019) Learning with average precision: training image retrieval with a listwise loss. In: International conference on computer vision, pp 5106\u20135115 (IEEE). https:\/\/doi.org\/10.1109\/ICCV.2019.00521","DOI":"10.1109\/ICCV.2019.00521"},{"key":"325_CR27","doi-asserted-by":"crossref","unstructured":"Zhang H, Wang Y, Dayoub F, Sunderhauf N (2021) VarifocalNet: an IoU-aware dense object detector. In: 2021 IEEE\/CVF conference on computer vision and pattern recognition (CVPR) (IEEE)","DOI":"10.1109\/CVPR46437.2021.00841"},{"key":"325_CR28","doi-asserted-by":"crossref","unstructured":"He K, Gkioxari G, Doll\u00e1r P, Girshick R (2017) Mask R-CNN. In: Proceedings of the IEEE international conference on computer vision, pp 2961\u20132969","DOI":"10.1109\/ICCV.2017.322"},{"key":"325_CR29","doi-asserted-by":"crossref","unstructured":"Girshick R (2015) Fast R-CNN. In: Proceedings of the IEEE international conference on computer vision, pp 1440\u20131448","DOI":"10.1109\/ICCV.2015.169"},{"key":"325_CR30","doi-asserted-by":"publisher","first-page":"1512","DOI":"10.1109\/TIP.2009.2019809","volume":"18","author":"J Van De Weijer","year":"2009","unstructured":"Van De Weijer J, Schmid C, Verbeek J, Larlus D (2009) Learning color names for real-world applications. IEEE Trans Image Process 18:1512\u20131523. https:\/\/doi.org\/10.1109\/TIP.2009.2019809","journal-title":"IEEE Trans Image Process"},{"key":"325_CR31","doi-asserted-by":"publisher","first-page":"2582","DOI":"10.1364\/JOSAA.25.002582","volume":"25","author":"R Benavente","year":"2008","unstructured":"Benavente R, Vanrell M, Baldrich R (2008) Parametric fuzzy sets for automatic color naming. JOSA A 25:2582\u20132593. https:\/\/doi.org\/10.1364\/JOSAA.25.002582","journal-title":"JOSA A"},{"key":"325_CR32","doi-asserted-by":"publisher","first-page":"59","DOI":"10.1007\/BF00337288","volume":"43","author":"T Kohonen","year":"1982","unstructured":"Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern 43:59\u201369","journal-title":"Biol Cybern"},{"key":"325_CR33","doi-asserted-by":"crossref","unstructured":"Barthel KU, Hezel N, Jung K, Schall K (2023) Improved evaluation and generation of grid layouts using distance preservation quality and linear assignment sorting. In: Computer graphics forum","DOI":"10.1111\/cgf.14718"},{"key":"325_CR34","doi-asserted-by":"publisher","unstructured":"Ma Y et\u00a0al (2022) X-clip: end-to-end multi-grained contrastive learning for video-text retrieval, pp 638-647. https:\/\/doi.org\/10.1145\/3503161.3547910","DOI":"10.1145\/3503161.3547910"},{"key":"325_CR35","unstructured":"Bain M, Nagrani A, Varol G, Zisserman A (2022) A clip-hitchhiker\u2019s guide to long video retrieval. arXiv:2205.08508"},{"key":"325_CR36","doi-asserted-by":"crossref","unstructured":"Ali A, Schwartz I, Hazan T, Wolf L (2022) Video and text matching with conditioned embeddings, pp 1565\u20131574","DOI":"10.1109\/WACV51458.2022.00055"},{"key":"325_CR37","unstructured":"Vaswani A et\u00a0al (2017) Attention is all you need. In: Guyon I et\u00a0al (eds.) Advances in neural information processing systems, vol\u00a030. Curran Associates, Inc. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2017\/file\/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"},{"key":"325_CR38","doi-asserted-by":"publisher","unstructured":"Rossetto L, Gasser R, Sauter L, Bernstein A, Schuldt H, Lokoc J et\u00a0al (2021) A system for interactive multimedia retrieval evaluations. In: Lokoc J et\u00a0al (eds.) International conference on multimedia modeling. Springer. https:\/\/doi.org\/10.1007\/978-3-030-67835-7_33","DOI":"10.1007\/978-3-030-67835-7_33"},{"key":"325_CR39","doi-asserted-by":"publisher","unstructured":"Rossetto L, Schuldt H, Awad G, Butt AA, Kompatsiaris I et\u00a0al (2019) V3C\u2014a research video collection. Kompatsiaris I, et\u00a0al (eds.) International conference on multimedia modeling. Springer, pp 349\u2013360. https:\/\/doi.org\/10.1007\/978-3-030-05710-7_29","DOI":"10.1007\/978-3-030-05710-7_29"},{"key":"325_CR40","doi-asserted-by":"publisher","unstructured":"Loko\u010d J et\u00a0al (2019) Interactive search or sequential browsing? A detailed analysis of the video browser showdown 2018. In: ACM transactions on multimedia computing, communications, and applications, vol 15. https:\/\/doi.org\/10.1145\/3295663","DOI":"10.1145\/3295663"},{"key":"325_CR41","volume-title":"Natural language processing with Python: analyzing text with the natural language toolkit","author":"S Bird","year":"2009","unstructured":"Bird S, Klein E, Loper E (2009) Natural language processing with Python: analyzing text with the natural language toolkit. O\u2019Reilly Media, Inc., Sebastopol"}],"container-title":["International Journal of Multimedia Information Retrieval"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s13735-024-00325-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s13735-024-00325-9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s13735-024-00325-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,22]],"date-time":"2024-05-22T03:03:55Z","timestamp":1716347035000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s13735-024-00325-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,26]]},"references-count":41,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,6]]}},"alternative-id":["325"],"URL":"https:\/\/doi.org\/10.1007\/s13735-024-00325-9","relation":{"has-preprint":[{"id-type":"doi","id":"10.21203\/rs.3.rs-3328018\/v1","asserted-by":"object"}]},"ISSN":["2192-6611","2192-662X"],"issn-type":[{"value":"2192-6611","type":"print"},{"value":"2192-662X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,3,26]]},"assertion":[{"value":"5 September 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 February 2024","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 February 2024","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 March 2024","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no competing interests. The authors have no competing interests to declare that are relevant to the content of this article. All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript. The authors have no financial or proprietary interests in any material discussed in this article. The raw log data analyzed during the current study are available in this repository:\n                      \n                      . The repository also includes scripts to generate the presented results, tables and figures. We would also like to thank all colleagues and student that took the time to participate in this evaluation.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}],"article-number":"15"}}