{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:33:35Z","timestamp":1750221215842,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":52,"publisher":"ACM","license":[{"start":{"date-parts":[[2018,10,19]],"date-time":"2018-10-19T00:00:00Z","timestamp":1539907200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Science Foundation Award","award":["1722847"],"award-info":[{"award-number":["1722847"]}]},{"DOI":"10.13039\/501100011002","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61573045"],"award-info":[{"award-number":["61573045"]}],"id":[{"id":"10.13039\/501100011002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2018,10,19]]},"DOI":"10.1145\/3265845.3265851","type":"proceedings-article","created":{"date-parts":[[2018,10,23]],"date-time":"2018-10-23T12:17:05Z","timestamp":1540297025000},"page":"77-85","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["Sports Video Captioning by Attentive Motion Representation based Hierarchical Recurrent Neural Networks"],"prefix":"10.1145","author":[{"given":"Mengshi","family":"Qi","sequence":"first","affiliation":[{"name":"Beihang University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yunhong","family":"Wang","sequence":"additional","affiliation":[{"name":"Beihang University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Annan","family":"Li","sequence":"additional","affiliation":[{"name":"Beihang University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jiebo","family":"Luo","sequence":"additional","affiliation":[{"name":"University of Rochester, Rochester, NY, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2018,10,19]]},"reference":[{"key":"e_1_3_2_1_1_1","first-page":"265","article-title":"TensorFlow: A System for Large-Scale Machine Learning","volume":"16","author":"Abadi Mart'in","year":"2016","unstructured":"Mart'in Abadi , Paul Barham , Jianmin Chen , Zhifeng Chen , Andy Davis , Jeffrey Dean , Matthieu Devin , Sanjay Ghemawat , Geoffrey Irving , Michael Isard , 2016 . TensorFlow: A System for Large-Scale Machine Learning .. In OSDI , Vol. 16. 265 -- 283 . Mart'in Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et almbox. 2016. TensorFlow: A System for Large-Scale Machine Learning.. In OSDI, Vol. 16. 265--283.","journal-title":"OSDI"},{"volume-title":"Hierarchical boundary-aware neural encoder for video captioning","author":"Baraldi Lorenzo","key":"e_1_3_2_1_2_1","unstructured":"Lorenzo Baraldi , Costantino Grana , and Rita Cucchiara . 2017. Hierarchical boundary-aware neural encoder for video captioning . In CVPR. IEEE. Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2017. Hierarchical boundary-aware neural encoder for video captioning. In CVPR. IEEE."},{"volume-title":"CVPR","author":"Cao Zhe","key":"e_1_3_2_1_3_1","unstructured":"Zhe Cao , Tomas Simon , Shih-En Wei , and Yaser Sheikh . 2017. Realtime multi-person 2d pose estimation using part affinity fields . In CVPR . IEEE. Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR . IEEE."},{"key":"e_1_3_2_1_4_1","volume-title":"P-cnn: Pose-based cnn features for action recognition","author":"Ch\u00e9ron Guilhem","year":"2015","unstructured":"Guilhem Ch\u00e9ron , Ivan Laptev , and Cordelia Schmid . 2015 . P-cnn: Pose-based cnn features for action recognition . In ICCV. IEEE. Guilhem Ch\u00e9ron, Ivan Laptev, and Cordelia Schmid. 2015. P-cnn: Pose-based cnn features for action recognition. In ICCV. IEEE."},{"key":"e_1_3_2_1_5_1","volume-title":"Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint","author":"Chung Junyoung","year":"2014","unstructured":"Junyoung Chung , Caglar Gulcehre , KyungHyun Cho , and Yoshua Bengio . 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint ( 2014 ). Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint (2014)."},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-3348"},{"key":"e_1_3_2_1_7_1","volume-title":"Improving interpretability of deep neural networks with semantic information. arXiv preprint","author":"Dong Yinpeng","year":"2017","unstructured":"Yinpeng Dong , Hang Su , Jun Zhu , and Bo Zhang . 2017. Improving interpretability of deep neural networks with semantic information. arXiv preprint ( 2017 ). Yinpeng Dong, Hang Su, Jun Zhu, and Bo Zhang. 2017. Improving interpretability of deep neural networks with semantic information. arXiv preprint (2017)."},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/957013.957020"},{"key":"e_1_3_2_1_9_1","volume-title":"Juan Carlos Niebles, and Bernard Ghanem.","author":"Escorcia Victor","year":"2016","unstructured":"Victor Escorcia , Fabian Caba Heilbron , Juan Carlos Niebles, and Bernard Ghanem. 2016 . Daps : Deep action proposals for action understanding. In ECCV. Springer . Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. 2016. Daps: Deep action proposals for action understanding. In ECCV. Springer."},{"volume-title":"Semantic compositional networks for visual captioning","author":"Gan Zhe","key":"e_1_3_2_1_10_1","unstructured":"Zhe Gan , Chuang Gan , Xiaodong He , Yunchen Pu , Kenneth Tran , Jianfeng Gao , Lawrence Carin , and Li Deng . 2017. Semantic compositional networks for visual captioning . In CVPR. IEEE. Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In CVPR. IEEE."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.169"},{"volume-title":"Finding action tubes","author":"Gkioxari Georgia","key":"e_1_3_2_1_12_1","unstructured":"Georgia Gkioxari and Jitendra Malik . 2015. Finding action tubes . In CVPR. IEEE. Georgia Gkioxari and Jitendra Malik. 2015. Finding action tubes. In CVPR. IEEE."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.337"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_2_1_15_1","volume-title":"Attention-based multimodal fusion for video description. arXiv preprint","author":"Hori Chiori","year":"2017","unstructured":"Chiori Hori , Takaaki Hori , Teng-Yok Lee , Kazuhiro Sumi , John R Hershey , and Tim K Marks . 2017. Attention-based multimodal fusion for video description. arXiv preprint ( 2017 ). Chiori Hori, Takaaki Hori, Teng-Yok Lee, Kazuhiro Sumi, John R Hershey, and Tim K Marks. 2017. Attention-based multimodal fusion for video description. arXiv preprint (2017)."},{"volume-title":"A hierarchical deep temporal model for group activity recognition","author":"Ibrahim Mostafa S","key":"e_1_3_2_1_16_1","unstructured":"Mostafa S Ibrahim , Srikanth Muralidharan , Zhiwei Deng , Arash Vahdat , and Greg Mori . 2016. A hierarchical deep temporal model for group activity recognition . In CVPR. IEEE. Mostafa S Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. 2016. A hierarchical deep temporal model for group activity recognition. In CVPR. IEEE."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.223"},{"key":"e_1_3_2_1_18_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint","author":"Kingma Diederik P","year":"2014","unstructured":"Diederik P Kingma and Jimmy Ba . 2014 . Adam: A method for stochastic optimization. arXiv preprint (2014). Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint (2014)."},{"volume-title":"Dense-captioning events in videos","author":"Krishna Ranjay","key":"e_1_3_2_1_19_1","unstructured":"Ranjay Krishna , Kenji Hata , Frederic Ren , Li Fei-Fei , and Juan Carlos Niebles . 2017. Dense-captioning events in videos . In ICCV. IEEE. Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In ICCV. IEEE."},{"key":"e_1_3_2_1_20_1","unstructured":"Alex Krizhevsky Ilya Sutskever and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS .   Alex Krizhevsky Ilya Sutskever and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS ."},{"key":"e_1_3_2_1_21_1","unstructured":"John Lafferty Andrew McCallum and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. (2001).  John Lafferty Andrew McCallum and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. (2001)."},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.3115\/1218955.1219032"},{"volume-title":"CVPR","author":"Lu Jiasen","key":"e_1_3_2_1_23_1","unstructured":"Jiasen Lu , Caiming Xiong , Devi Parikh , and Richard Socher . 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning . In CVPR . IEEE. Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In CVPR . IEEE."},{"key":"e_1_3_2_1_24_1","unstructured":"Alejandro Newell Zhiao Huang and Jia Deng. 2017. Associative embedding: End-to-end learning for joint detection and grouping. In NIPS .  Alejandro Newell Zhiao Huang and Jia Deng. 2017. Associative embedding: End-to-end learning for joint detection and grouping. In NIPS ."},{"volume-title":"Stacked hourglass networks for human pose estimation","author":"Newell Alejandro","key":"e_1_3_2_1_25_1","unstructured":"Alejandro Newell , Kaiyu Yang , and Jia Deng . 2016. Stacked hourglass networks for human pose estimation . In ECCV. Springer . Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In ECCV. Springer."},{"volume-title":"Hierarchical recurrent neural encoder for video representation with application to captioning","author":"Pan Pingbo","key":"e_1_3_2_1_26_1","unstructured":"Pingbo Pan , Zhongwen Xu , Yi Yang , Fei Wu , and Yueting Zhuang . 2016b. Hierarchical recurrent neural encoder for video representation with application to captioning . In CVPR. IEEE. Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016b. Hierarchical recurrent neural encoder for video representation with application to captioning. In CVPR. IEEE."},{"volume-title":"CVPR","author":"Pan Yingwei","key":"e_1_3_2_1_27_1","unstructured":"Yingwei Pan , Tao Mei , Ting Yao , Houqiang Li , and Yong Rui . 2016a. Jointly modeling embedding and translation to bridge video and language . In CVPR . IEEE. Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016a. Jointly modeling embedding and translation to bridge video and language. In CVPR . IEEE."},{"volume-title":"Video captioning with transferred semantic attributes","author":"Pan Yingwei","key":"e_1_3_2_1_28_1","unstructured":"Yingwei Pan , Ting Yao , Houqiang Li , and Tao Mei . 2017. Video captioning with transferred semantic attributes . In CVPR. IEEE. Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In CVPR. IEEE."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073083.1073135"},{"key":"e_1_3_2_1_30_1","volume-title":"Multi-task video captioning with video and entailment generation. arXiv preprint","author":"Pasunuru Ramakanth","year":"2017","unstructured":"Ramakanth Pasunuru and Mohit Bansal . 2017. Multi-task video captioning with video and entailment generation. arXiv preprint ( 2017 ). Ramakanth Pasunuru and Mohit Bansal. 2017. Multi-task video captioning with video and entailment generation. arXiv preprint (2017)."},{"volume-title":"stagNet: An Attentive Semantic RNN for Group Activity Recognition","author":"Qi Mengshi","key":"e_1_3_2_1_31_1","unstructured":"Mengshi Qi , Jie Qin , Annan Li , Yunhong Wang , Jiebo Luo , and Luc Van Gool . 2018. stagNet: An Attentive Semantic RNN for Group Activity Recognition . In ECCV. Springer . Mengshi Qi, Jie Qin, Annan Li, Yunhong Wang, Jiebo Luo, and Luc Van Gool. 2018. stagNet: An Attentive Semantic RNN for Group Activity Recognition. In ECCV. Springer."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123311"},{"volume-title":"Discovering discriminative action parts from mid-level video representations","author":"Raptis Michalis","key":"e_1_3_2_1_33_1","unstructured":"Michalis Raptis , Iasonas Kokkinos , and Stefano Soatto . 2012. Discovering discriminative action parts from mid-level video representations . In CVPR. IEEE. Michalis Raptis, Iasonas Kokkinos, and Stefano Soatto. 2012. Discovering discriminative action parts from mid-level video representations. In CVPR. IEEE."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.61"},{"volume-title":"Weakly supervised dense video captioning","author":"Shen Zhiqiang","key":"e_1_3_2_1_35_1","unstructured":"Zhiqiang Shen , Jianguo Li , Zhou Su , Minjun Li , Yurong Chen , Yu-Gang Jiang , and Xiangyang Xue . 2017. Weakly supervised dense video captioning . In CVPR. IEEE. Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, Yu-Gang Jiang, and Xiangyang Xue. 2017. Weakly supervised dense video captioning. In CVPR. IEEE."},{"key":"e_1_3_2_1_36_1","volume-title":"Very deep convolutional networks for large-scale image recognition. arXiv preprint","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman . 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint ( 2014 ). Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint (2014)."},{"key":"e_1_3_2_1_37_1","volume-title":"Amir Roshan Zamir, and Mubarak Shah","author":"Soomro Khurram","year":"2012","unstructured":"Khurram Soomro , Amir Roshan Zamir, and Mubarak Shah . 2012 . UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint (2012). Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint (2012)."},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.335"},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_2_1_40_1","volume-title":"Cider: Consensus-based image description evaluation","author":"Vedantam Ramakrishna","year":"2015","unstructured":"Ramakrishna Vedantam , C Lawrence Zitnick , and Devi Parikh . 2015 . Cider: Consensus-based image description evaluation . In CVPR. IEEE. Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR. IEEE."},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.515"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2011.5995407"},{"volume-title":"CVPR","author":"Wang Limin","key":"e_1_3_2_1_43_1","unstructured":"Limin Wang , Yu Qiao , and Xiaoou Tang . 2015. Action recognition with trajectory-pooled deep-convolutional descriptors . In CVPR . IEEE. Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR . IEEE."},{"volume-title":"Interpretable Video Captioning via Trajectory Structured Localization","author":"Wu Xian","key":"e_1_3_2_1_44_1","unstructured":"Xian Wu , Guanbin Li , Qingxing Cao , Qingge Ji , and Liang Lin . 2018. Interpretable Video Captioning via Trajectory Structured Localization . In CVPR. IEEE. Xian Wu, Guanbin Li, Qingxing Cao, Qingge Ji, and Liang Lin. 2018. Interpretable Video Captioning via Trajectory Structured Localization. In CVPR. IEEE."},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/1180639.1180699"},{"key":"e_1_3_2_1_46_1","volume-title":"Msr-vtt: A large video description dataset for bridging video and language","author":"Xu Jun","year":"2016","unstructured":"Jun Xu , Tao Mei , Ting Yao , and Yong Rui . 2016 . Msr-vtt: A large video description dataset for bridging video and language . In CVPR. IEEE. Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In CVPR. IEEE."},{"key":"e_1_3_2_1_47_1","unstructured":"Kelvin Xu Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron Courville Ruslan Salakhudinov Rich Zemel and Yoshua Bengio. 2015a. Show attend and tell: Neural image caption generation with visual attention. In ICML .   Kelvin Xu Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron Courville Ruslan Salakhudinov Rich Zemel and Yoshua Bengio. 2015a. Show attend and tell: Neural image caption generation with visual attention. In ICML ."},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"crossref","unstructured":"Ran Xu Caiming Xiong Wei Chen and Jason J Corso. 2015b. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework.. In AAAI .   Ran Xu Caiming Xiong Wei Chen and Jason J Corso. 2015b. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework.. In AAAI .","DOI":"10.1609\/aaai.v29i1.9512"},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.512"},{"volume-title":"Image captioning with semantic attention","author":"You Quanzeng","key":"e_1_3_2_1_50_1","unstructured":"Quanzeng You , Hailin Jin , Zhaowen Wang , Chen Fang , and Jiebo Luo . 2016. Image captioning with semantic attention . In CVPR. IEEE. Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In CVPR. IEEE."},{"volume-title":"CVPR","author":"Yu Haonan","key":"e_1_3_2_1_51_1","unstructured":"Haonan Yu , Jiang Wang , Zhiheng Huang , Yi Yang , and Wei Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks . In CVPR . IEEE. Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In CVPR . IEEE."},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/1291233.1291250"}],"event":{"name":"MM '18: ACM Multimedia Conference","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Seoul Republic of Korea","acronym":"MM '18"},"container-title":["Proceedings of the 1st International Workshop on Multimedia Content Analysis in Sports"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3265845.3265851","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3265845.3265851","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T01:39:50Z","timestamp":1750210790000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3265845.3265851"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,10,19]]},"references-count":52,"alternative-id":["10.1145\/3265845.3265851","10.1145\/3265845"],"URL":"https:\/\/doi.org\/10.1145\/3265845.3265851","relation":{},"subject":[],"published":{"date-parts":[[2018,10,19]]},"assertion":[{"value":"2018-10-19","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}