{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:08:52Z","timestamp":1750219732066,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":30,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,10,29]],"date-time":"2023-10-29T00:00:00Z","timestamp":1698537600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,10,29]]},"DOI":"10.1145\/3607540.3617143","type":"proceedings-article","created":{"date-parts":[[2023,10,30]],"date-time":"2023-10-30T01:00:50Z","timestamp":1698627650000},"page":"25-29","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Sequential Action Retrieval for Generating Narratives from Long Videos"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4673-6924","authenticated-orcid":false,"given":"Satoshi","family":"Yamazaki","sequence":"first","affiliation":[{"name":"NEC Corporation, Tokyo, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4303-9020","authenticated-orcid":false,"given":"Jianquan","family":"Liu","sequence":"additional","affiliation":[{"name":"NEC Corporation, Tokyo, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4846-2015","authenticated-orcid":false,"given":"Mohan","family":"Kankanhalli","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,10,29]]},"reference":[{"key":"e_1_3_2_1_1_1","first-page":"23716","article-title":"Flamingo: a visual language model for few-shot learning","volume":"35","author":"Alayrac Jean-Baptiste","year":"2022","unstructured":"Jean-Baptiste Alayrac , Jeff Donahue , Pauline Luc , Antoine Miech , Iain Barr , Yana Hasson , Karel Lenc , Arthur Mensch , Katherine Millican , Malcolm Reynolds , 2022 . Flamingo: a visual language model for few-shot learning . Advances in Neural Information Processing Systems , Vol. 35 (2022), 23716 -- 23736 . Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems , Vol. 35 (2022), 23716--23736.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"crossref","unstructured":"Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR. 6077--6086.  Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR. 6077--6086.","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_1_3_1","volume-title":"VQA: Visual Question Answering. In ICCV. 2425--2433.","author":"Antol Stanislaw","year":"2015","unstructured":"Stanislaw Antol , Aishwarya Agrawal , Jiasen Lu , Margaret Mitchell , Dhruv Batra , C. Lawrence Zitnick , and Devi Parikh . 2015 . VQA: Visual Question Answering. In ICCV. 2425--2433. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In ICCV. 2425--2433."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01246-5_24"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"crossref","first-page":"e01542","DOI":"10.1002\/bes2.1542","article-title":"Storytelling: A Natural Tool to Weave the Threads of Science and Community Together","volume":"100","author":"Bayer Skylar","year":"2019","unstructured":"Skylar Bayer and Annaliese Hettinger . 2019 . Storytelling: A Natural Tool to Weave the Threads of Science and Community Together . The Bulletin of the Ecological Society of America , Vol. 100 , 2 (2019), e01542 . Skylar Bayer and Annaliese Hettinger. 2019. Storytelling: A Natural Tool to Weave the Threads of Science and Community Together. The Bulletin of the Ecological Society of America, Vol. 100, 2 (2019), e01542.","journal-title":"The Bulletin of the Ecological Society of America"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2016.7533003"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"crossref","unstructured":"Jo a o Carreira and Andrew Zisserman. 2017. Quo Vadis Action Recognition? A New Model and the Kinetics Dataset. In CVPR. 4724--4733.  Jo a o Carreira and Andrew Zisserman. 2017. Quo Vadis Action Recognition? A New Model and the Kinetics Dataset. In CVPR. 4724--4733.","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"crossref","unstructured":"Christoph Feichtenhofer Haoqi Fan Jitendra Malik and Kaiming He. 2019. SlowFast Networks for Video Recognition. In ICCV. 6201--6210.  Christoph Feichtenhofer Haoqi Fan Jitendra Malik and Kaiming He. 2019. SlowFast Networks for Video Recognition. In ICCV. 6201--6210.","DOI":"10.1109\/ICCV.2019.00630"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298676"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00633"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298928"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01025"},{"key":"e_1_3_2_1_13_1","volume-title":"ICML (Proceedings of Machine Learning Research","volume":"5594","author":"Kim Wonjae","year":"2021","unstructured":"Wonjae Kim , Bokyung Son , and Ildoo Kim . 2021 . ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision . In ICML (Proceedings of Machine Learning Research , Vol. 139). PMLR, 5583-- 5594 . Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In ICML (Proceedings of Machine Learning Research, Vol. 139). PMLR, 5583--5594."},{"key":"e_1_3_2_1_14_1","volume-title":"Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597","author":"Li Junnan","year":"2023","unstructured":"Junnan Li , Dongxu Li , Silvio Savarese , and Steven Hoi . 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 ( 2023 ). Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)."},{"key":"e_1_3_2_1_15_1","first-page":"17939","article-title":"MOMA: Multi-Object Multi-Actor Activity Parsing","volume":"34","author":"Luo Zelun","year":"2021","unstructured":"Zelun Luo , Wanze Xie , Siddharth Kapoor , Yiyun Liang , Michael Cooper , Juan Carlos Niebles , Ehsan Adeli , and Fei-Fei Li . 2021 . MOMA: Multi-Object Multi-Actor Activity Parsing . Advances in Neural Information Processing Systems , Vol. 34 (2021), 17939 -- 17955 . Zelun Luo, Wanze Xie, Siddharth Kapoor, Yiyun Liang, Michael Cooper, Juan Carlos Niebles, Ehsan Adeli, and Fei-Fei Li. 2021. MOMA: Multi-Object Multi-Actor Activity Parsing. Advances in Neural Information Processing Systems , Vol. 34 (2021), 17939--17955.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10599-4_35"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01103"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-48881-3_2"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.229"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46448-0_31"},{"key":"e_1_3_2_1_21_1","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition. 5179--5187","author":"Song Yale","year":"2015","unstructured":"Yale Song , Jordi Vallmitjana , Amanda Stent , and Alejandro Jaimes . 2015 . Tvsum: Summarizing web videos using titles . In Proceedings of the IEEE conference on computer vision and pattern recognition. 5179--5187 . Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. 2015. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5179--5187."},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3552463.3557021"},{"key":"e_1_3_2_1_23_1","first-page":"200","article-title":"Multimodal few-shot learning with frozen language models","volume":"34","author":"Tsimpoukelli Maria","year":"2021","unstructured":"Maria Tsimpoukelli , Jacob L Menick , Serkan Cabi , SM Eslami , Oriol Vinyals , and Felix Hill . 2021 . Multimodal few-shot learning with frozen language models . Advances in Neural Information Processing Systems , Vol. 34 (2021), 200 -- 212 . Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems , Vol. 34 (2021), 200--212.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"crossref","unstructured":"Yongkang Wong Shaojing Fan Yangyang Guo Ziwei Xu Karen Stephen Rishabh Sheoran Anusha Bhamidipati Vivek Barsopia Jianquan Liu and Mohan Kankanhalli. 2022. Compute to Tell the Tale: Goal-Driven Narrative Generation. In ACM Multimedia (to be published).  Yongkang Wong Shaojing Fan Yangyang Guo Ziwei Xu Karen Stephen Rishabh Sheoran Anusha Bhamidipati Vivek Barsopia Jianquan Liu and Mohan Kankanhalli. 2022. Compute to Tell the Tale: Goal-Driven Narrative Generation. In ACM Multimedia (to be published).","DOI":"10.1145\/3503161.3549202"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.330"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00611"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.3390\/s19051005"},{"key":"e_1_3_2_1_28_1","volume-title":"multi-camera tracking by hierarchical clustering: Recent progress on dukemtmc project. arXiv preprint arXiv:1712.09531","author":"Zhang Zhimeng","year":"2017","unstructured":"Zhimeng Zhang , Jianan Wu , Xuan Zhang , and Chi Zhang . 2017. Multi-target , multi-camera tracking by hierarchical clustering: Recent progress on dukemtmc project. arXiv preprint arXiv:1712.09531 ( 2017 ). Zhimeng Zhang, Jianan Wu, Xuan Zhang, and Chi Zhang. 2017. Multi-target, multi-camera tracking by hierarchical clustering: Recent progress on dukemtmc project. arXiv preprint arXiv:1712.09531 (2017)."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00380"},{"key":"e_1_3_2_1_30_1","volume-title":"Proceedings, Part IX. Springer, 350--368","author":"Zhou Xingyi","year":"2022","unstructured":"Xingyi Zhou , Rohit Girdhar , Armand Joulin , Philipp Kr\"ahenb \u00fchl , and Ishan Misra . 2022 . Detecting twenty-thousand classes using image-level supervision. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022 , Proceedings, Part IX. Springer, 350--368 . io Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Kr\"ahenb\u00fchl, and Ishan Misra. 2022. Detecting twenty-thousand classes using image-level supervision. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part IX. Springer, 350--368. io"}],"event":{"name":"MM '23: The 31st ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Ottawa ON Canada","acronym":"MM '23"},"container-title":["Proceedings of the 2nd Workshop on User-centric Narrative Summarization of Long Videos"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3607540.3617143","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3607540.3617143","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:36:28Z","timestamp":1750178188000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3607540.3617143"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,29]]},"references-count":30,"alternative-id":["10.1145\/3607540.3617143","10.1145\/3607540"],"URL":"https:\/\/doi.org\/10.1145\/3607540.3617143","relation":{},"subject":[],"published":{"date-parts":[[2023,10,29]]},"assertion":[{"value":"2023-10-29","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}