{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:21:20Z","timestamp":1750220480640,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":33,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,10,18]],"date-time":"2021-10-18T00:00:00Z","timestamp":1634515200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,10,18]]},"DOI":"10.1145\/3462244.3479913","type":"proceedings-article","created":{"date-parts":[[2021,10,15]],"date-time":"2021-10-15T15:01:58Z","timestamp":1634310118000},"page":"595-603","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Cross Lingual Video and Text Retrieval: A New Benchmark Dataset and Algorithm"],"prefix":"10.1145","author":[{"given":"Jayaprakash","family":"Akula","sequence":"first","affiliation":[{"name":"Indian Institute Of Technology Bombay, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"family":"Abhishek","sequence":"additional","affiliation":[{"name":"Indian Institute of Technology, Bombay, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rishabh","family":"Dabral","sequence":"additional","affiliation":[{"name":"IIT Bombay, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Preethi","family":"Jyothi","sequence":"additional","affiliation":[{"name":"IIT Bombay, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ganesh","family":"Ramakrishnan","sequence":"additional","affiliation":[{"name":"IIT Bombay, India"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,10,18]]},"reference":[{"doi-asserted-by":"crossref","unstructured":"Relja Arandjelovic Petr Gronat Akihiko Torii Tomas Pajdla and Josef Sivic. 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In CVPR.  Relja Arandjelovic Petr Gronat Akihiko Torii Tomas Pajdla and Josef Sivic. 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In CVPR.","key":"e_1_3_2_1_1_1","DOI":"10.1109\/CVPR.2016.572"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_2_1","DOI":"10.1109\/ICASSP40776.2020.9052974"},{"doi-asserted-by":"crossref","unstructured":"Joao Carreira and Andrew Zisserman. 2017. Quo Vadis Action Recognition? A New Model and the Kinetics Dataset. In CVPR.  Joao Carreira and Andrew Zisserman. 2017. Quo Vadis Action Recognition? A New Model and the Kinetics Dataset. In CVPR.","key":"e_1_3_2_1_3_1","DOI":"10.1109\/CVPR.2017.502"},{"doi-asserted-by":"crossref","unstructured":"Gal Chechik Varun Sharma Uri Shalit and Samy Bengio. 2010. Large Scale Online Learning of Image Similarity Through Ranking.Journal of Machine Learning Research(2010).  Gal Chechik Varun Sharma Uri Shalit and Samy Bengio. 2010. Large Scale Online Learning of Image Similarity Through Ranking.Journal of Machine Learning Research(2010).","key":"e_1_3_2_1_4_1","DOI":"10.1007\/978-3-642-02172-5_2"},{"key":"e_1_3_2_1_5_1","volume-title":"\u00a0M. Snoek","author":"Dong Jianfeng","year":"2016","unstructured":"Jianfeng Dong , Xirong Li , and Cees G . \u00a0M. Snoek . 2016 . Word2VisualVec: Cross-Media Retrieval by Visual Feature Prediction. ArXiv abs\/1604.06838(2016). Jianfeng Dong, Xirong Li, and Cees G.\u00a0M. Snoek. 2016. Word2VisualVec: Cross-Media Retrieval by Visual Feature Prediction. ArXiv abs\/1604.06838(2016)."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_6_1","DOI":"10.1109\/TMM.2018.2832602"},{"key":"e_1_3_2_1_7_1","volume-title":"Article arXiv:1809.06181 (Sept.","author":"Dong Jianfeng","year":"2018","unstructured":"Jianfeng Dong , Xirong Li , Chaoxi Xu , Shouling Ji , Yuan He , Gang Yang , and Xun Wang . 2018. Dual Encoding for Zero-Example Video Retrieval. arXiv e-prints , Article arXiv:1809.06181 (Sept. 2018 ), arXiv:1809.06181\u00a0pages. arxiv:1809.06181\u00a0[cs.CV] Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2018. Dual Encoding for Zero-Example Video Retrieval. arXiv e-prints, Article arXiv:1809.06181 (Sept. 2018), arXiv:1809.06181\u00a0pages. arxiv:1809.06181\u00a0[cs.CV]"},{"unstructured":"Andrea Frome Greg\u00a0S Corrado Jon Shlens Samy Bengio Jeff Dean Marc'\u00a0Aurelio Ranzato and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In NeurIPS.  Andrea Frome Greg\u00a0S Corrado Jon Shlens Samy Bengio Jeff Dean Marc'\u00a0Aurelio Ranzato and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In NeurIPS.","key":"e_1_3_2_1_8_1"},{"key":"e_1_3_2_1_9_1","volume-title":"Proceedings of the 29th Conference on Neural Information Processing Systems (NeurIPS).","author":"Gao Haoyuan","year":"2015","unstructured":"Haoyuan Gao , Junhua Mao , Jie Zhou , Zhiheng Huang , Lei Wang , and Wei Xu . 2015 . Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering . In Proceedings of the 29th Conference on Neural Information Processing Systems (NeurIPS). Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. 2015. Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering. In Proceedings of the 29th Conference on Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_3_2_1_10_1","volume-title":"Tall: Temporal activity localization via language query. In ICCV.","author":"Gao Jiyang","year":"2017","unstructured":"Jiyang Gao , Chen Sun , Zhenheng Yang , and Ram Nevatia . 2017 . Tall: Temporal activity localization via language query. In ICCV. Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In ICCV."},{"doi-asserted-by":"crossref","unstructured":"Lisa\u00a0Anne Hendricks Oliver Wang Eli Shechtman Josef Sivic Trevor Darrell and Bryan Russell. 2018. Localizing Moments in Video with Temporal Language. In EMNLP.  Lisa\u00a0Anne Hendricks Oliver Wang Eli Shechtman Josef Sivic Trevor Darrell and Bryan Russell. 2018. Localizing Moments in Video with Temporal Language. In EMNLP.","key":"e_1_3_2_1_11_1","DOI":"10.18653\/v1\/D18-1168"},{"doi-asserted-by":"crossref","unstructured":"Shawn Hershey Sourish Chaudhuri Daniel\u00a0PW Ellis Jort\u00a0F Gemmeke Aren Jansen R\u00a0Channing Moore Manoj Plakal Devin Platt Rif\u00a0A Saurous Bryan Seybold 2017. CNN architectures for large-scale audio classification. In ICASSP.  Shawn Hershey Sourish Chaudhuri Daniel\u00a0PW Ellis Jort\u00a0F Gemmeke Aren Jansen R\u00a0Channing Moore Manoj Plakal Devin Platt Rif\u00a0A Saurous Bryan Seybold 2017. CNN architectures for large-scale audio classification. In ICASSP.","key":"e_1_3_2_1_12_1","DOI":"10.1109\/ICASSP.2017.7952132"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_13_1","DOI":"10.1109\/TPAMI.2019.2913372"},{"unstructured":"Max Jaderberg Karen Simonyan Andrea Vedaldi and Andrew Zisserman. 2014. Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227(2014).  Max Jaderberg Karen Simonyan Andrea Vedaldi and Andrew Zisserman. 2014. Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227(2014).","key":"e_1_3_2_1_14_1"},{"doi-asserted-by":"crossref","unstructured":"K. Karaman E. Gundogdu A. Ko\u00e7 and A.\u00a0A. Alatan. 2019. Quadruplet Selection Methods for Deep Embedding Learning. In ICIP.  K. Karaman E. Gundogdu A. Ko\u00e7 and A.\u00a0A. Alatan. 2019. Quadruplet Selection Methods for Deep Embedding Learning. In ICIP.","key":"e_1_3_2_1_15_1","DOI":"10.1109\/ICIP.2019.8803401"},{"key":"e_1_3_2_1_16_1","volume-title":"Article arXiv:1411.2539 (Nov.","author":"Kiros Ryan","year":"2014","unstructured":"Ryan Kiros , Ruslan Salakhutdinov , and Richard\u00a0 S. Zemel . 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. arXiv e-prints , Article arXiv:1411.2539 (Nov. 2014 ), arXiv:1411.2539\u00a0pages. arxiv:1411.2539\u00a0[cs.LG] Ryan Kiros, Ruslan Salakhutdinov, and Richard\u00a0S. Zemel. 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. arXiv e-prints, Article arXiv:1411.2539 (Nov. 2014), arXiv:1411.2539\u00a0pages. arxiv:1411.2539\u00a0[cs.LG]"},{"unstructured":"Y. Liu S. Albanie A. Nagrani and A. Zisserman. 2019. Use What You Have: Video retrieval using representations from collaborative experts. In BMVC.  Y. Liu S. Albanie A. Nagrani and A. Zisserman. 2019. Use What You Have: Video retrieval using representations from collaborative experts. In BMVC.","key":"e_1_3_2_1_17_1"},{"unstructured":"Andrew\u00a0Silva Meera\u00a0Hahn and James\u00a0M. Rehg. 2019. Action2Vec: A Crossmodal Embedding Approach to Action Learning. In CVPR-W.  Andrew\u00a0Silva Meera\u00a0Hahn and James\u00a0M. Rehg. 2019. Action2Vec: A Crossmodal Embedding Approach to Action Learning. In CVPR-W.","key":"e_1_3_2_1_18_1"},{"key":"e_1_3_2_1_19_1","volume-title":"Diane\u00a0Larlusand Dima Damen","author":"Michael\u00a0Wray Gabriela\u00a0Csurka","year":"2019","unstructured":"Gabriela\u00a0Csurka Michael\u00a0Wray , Diane\u00a0Larlusand Dima Damen . 2019 . Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings. In CVPR. 450\u2013459. Gabriela\u00a0Csurka Michael\u00a0Wray, Diane\u00a0Larlusand Dima Damen. 2019. Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings. In CVPR. 450\u2013459."},{"unstructured":"Antoine Miech Ivan Laptev and Josef Sivic. 2019. Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. In ICCV.  Antoine Miech Ivan Laptev and Josef Sivic. 2019. Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. In ICCV.","key":"e_1_3_2_1_20_1"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_21_1","DOI":"10.1007\/s13735-018-00166-3"},{"unstructured":"Ramon Sanabria Ozan Caglayan Shruti Palaskar Desmond Elliott Lo\u00efc Barrault Lucia Specia and Florian Metze. 2018. How2: A Large-scale Dataset for Multimodal Language Understanding. CoRR abs\/1811.00347(2018). arxiv:1811.00347http:\/\/arxiv.org\/abs\/1811.00347  Ramon Sanabria Ozan Caglayan Shruti Palaskar Desmond Elliott Lo\u00efc Barrault Lucia Specia and Florian Metze. 2018. How2: A Large-scale Dataset for Multimodal Language Understanding. CoRR abs\/1811.00347(2018). arxiv:1811.00347http:\/\/arxiv.org\/abs\/1811.00347","key":"e_1_3_2_1_22_1"},{"doi-asserted-by":"crossref","unstructured":"F. Schroff D. Kalenichenko and J. Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In CVPR.  F. Schroff D. Kalenichenko and J. Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In CVPR.","key":"e_1_3_2_1_23_1","DOI":"10.1109\/CVPR.2015.7298682"},{"key":"e_1_3_2_1_24_1","volume-title":"Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics","author":"Socher Richard","year":"2014","unstructured":"Richard Socher , Andrej Karpathy , Quoc\u00a0 V Le , Christopher\u00a0 D Manning , and Andrew\u00a0 Y Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics ( 2014 ). Richard Socher, Andrej Karpathy, Quoc\u00a0V Le, Christopher\u00a0D Manning, and Andrew\u00a0Y Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics (2014)."},{"doi-asserted-by":"crossref","unstructured":"D. Tran H. Wang L. Torresani J. Ray Y. LeCun and M. Paluri. 2018. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In CVPR. 6450\u20136459.  D. Tran H. Wang L. Torresani J. Ray Y. LeCun and M. Paluri. 2018. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In CVPR. 6450\u20136459.","key":"e_1_3_2_1_25_1","DOI":"10.1109\/CVPR.2018.00675"},{"key":"e_1_3_2_1_26_1","volume-title":"Article arXiv:1511.06361 (Nov.","author":"Vendrov Ivan","year":"2015","unstructured":"Ivan Vendrov , Ryan Kiros , Sanja Fidler , and Raquel Urtasun . 2015. Order-Embeddings of Images and Language. arXiv e-prints , Article arXiv:1511.06361 (Nov. 2015 ), arXiv:1511.06361\u00a0pages. arxiv:1511.06361\u00a0[cs.LG] Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2015. Order-Embeddings of Images and Language. arXiv e-prints, Article arXiv:1511.06361 (Nov. 2015), arXiv:1511.06361\u00a0pages. arxiv:1511.06361\u00a0[cs.LG]"},{"doi-asserted-by":"crossref","unstructured":"Subhashini Venugopalan Huijuan Xu Jeff Donahue Marcus Rohrbach Raymond Mooney and Kate Saenko. 2015. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In NAACL-HLT.  Subhashini Venugopalan Huijuan Xu Jeff Donahue Marcus Rohrbach Raymond Mooney and Kate Saenko. 2015. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In NAACL-HLT.","key":"e_1_3_2_1_27_1","DOI":"10.3115\/v1\/N15-1173"},{"key":"e_1_3_2_1_28_1","volume-title":"VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. CoRR abs\/1904.03493(2019). arxiv:1904.03493http:\/\/arxiv.org\/abs\/1904.03493","author":"Wang Xin","year":"2019","unstructured":"Xin Wang , Jiawei Wu , Junkun Chen , Lei Li , Yuan-Fang Wang , and William\u00a0Yang Wang . 2019 . VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. CoRR abs\/1904.03493(2019). arxiv:1904.03493http:\/\/arxiv.org\/abs\/1904.03493 Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William\u00a0Yang Wang. 2019. VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. CoRR abs\/1904.03493(2019). arxiv:1904.03493http:\/\/arxiv.org\/abs\/1904.03493"},{"unstructured":"Chao-Yuan Wu R. Manmatha Alexander\u00a0J. Smola and Philipp Krahenbuhl. 2017. Sampling Matters in Deep Embedding Learning. In ICCV.  Chao-Yuan Wu R. Manmatha Alexander\u00a0J. Smola and Philipp Krahenbuhl. 2017. Sampling Matters in Deep Embedding Learning. In ICCV.","key":"e_1_3_2_1_29_1"},{"volume-title":"Aggregated Residual Transformations for Deep Neural Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5987\u20135995","author":"Xie S.","unstructured":"S. Xie , R. Girshick , P. Doll\u00e1r , Z. Tu , and K. He . 2017 . Aggregated Residual Transformations for Deep Neural Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5987\u20135995 . S. Xie, R. Girshick, P. Doll\u00e1r, Z. Tu, and K. He. 2017. Aggregated Residual Transformations for Deep Neural Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5987\u20135995.","key":"e_1_3_2_1_30_1"},{"doi-asserted-by":"crossref","unstructured":"Jun Xu Tao Mei Ting Yao and Yong Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In CVPR.  Jun Xu Tao Mei Ting Yao and Yong Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In CVPR.","key":"e_1_3_2_1_31_1","DOI":"10.1109\/CVPR.2016.571"},{"doi-asserted-by":"crossref","unstructured":"Lin Xu Han Sun and Yuai Liu. 2019. Learning with batch-wise optimal transport loss for 3d shape recognition. In CVPR.  Lin Xu Han Sun and Yuai Liu. 2019. Learning with batch-wise optimal transport loss for 3d shape recognition. In CVPR.","key":"e_1_3_2_1_32_1","DOI":"10.1109\/CVPR.2019.00345"},{"doi-asserted-by":"crossref","unstructured":"Bowen Zhang Hexiang Hu and Fei Sha. 2018. Cross-Modal and Hierarchical Modeling of Video and Text. In ECCV.  Bowen Zhang Hexiang Hu and Fei Sha. 2018. Cross-Modal and Hierarchical Modeling of Video and Text. In ECCV.","key":"e_1_3_2_1_33_1","DOI":"10.1007\/978-3-030-01261-8_23"}],"event":{"sponsor":["SIGCHI ACM Special Interest Group on Computer-Human Interaction"],"acronym":"ICMI '21","name":"ICMI '21: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION","location":"Montr\u00e9al QC Canada"},"container-title":["Proceedings of the 2021 International Conference on Multimodal Interaction"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3462244.3479913","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3462244.3479913","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:48:54Z","timestamp":1750193334000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3462244.3479913"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,18]]},"references-count":33,"alternative-id":["10.1145\/3462244.3479913","10.1145\/3462244"],"URL":"https:\/\/doi.org\/10.1145\/3462244.3479913","relation":{},"subject":[],"published":{"date-parts":[[2021,10,18]]},"assertion":[{"value":"2021-10-18","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}