{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,20]],"date-time":"2025-10-20T10:27:44Z","timestamp":1760956064819,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":73,"publisher":"ACM","license":[{"start":{"date-parts":[[2020,10,12]],"date-time":"2020-10-12T00:00:00Z","timestamp":1602460800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2020,10,12]]},"DOI":"10.1145\/3394171.3413860","type":"proceedings-article","created":{"date-parts":[[2020,10,12]],"date-time":"2020-10-12T13:10:44Z","timestamp":1602508244000},"page":"4004-4012","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":22,"title":["Deep Concept-wise Temporal Convolutional Networks for Action Localization"],"prefix":"10.1145","author":[{"given":"Xin","family":"Li","sequence":"first","affiliation":[{"name":"Baidu Inc., Beijing, China"}]},{"given":"Tianwei","family":"Lin","sequence":"additional","affiliation":[{"name":"Baidu Inc., Shanghai, China"}]},{"given":"Xiao","family":"Liu","sequence":"additional","affiliation":[{"name":"Baidu Inc., Beijing, China"}]},{"given":"Wangmeng","family":"Zuo","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology, Harbin, China"}]},{"given":"Chao","family":"Li","sequence":"additional","affiliation":[{"name":"Baidu Inc., Beijing, China"}]},{"given":"Xiang","family":"Long","sequence":"additional","affiliation":[{"name":"Baidu Inc., Beijing, China"}]},{"given":"Dongliang","family":"He","sequence":"additional","affiliation":[{"name":"Baidu Inc., Beijing, China"}]},{"given":"Fu","family":"Li","sequence":"additional","affiliation":[{"name":"Baidu Inc., Beijing, China"}]},{"given":"Shilei","family":"Wen","sequence":"additional","affiliation":[{"name":"Baidu Inc., Beijing, China"}]},{"given":"Chuang","family":"Gan","sequence":"additional","affiliation":[{"name":"MIT-Watson AI Lab., Boston, MA, USA"}]}],"member":"320","published-online":{"date-parts":[[2020,10,12]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"crossref","unstructured":"Relja Arandjelovic Petr Gronat Akihiko Torii Tomas Pajdla and Josef Sivic. 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In CVPR.  Relja Arandjelovic Petr Gronat Akihiko Torii Tomas Pajdla and Josef Sivic. 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In CVPR.","DOI":"10.1109\/CVPR.2016.572"},{"key":"e_1_3_2_2_2_1","volume":"2017","author":"Buch S.","unstructured":"S. Buch , V. Escorcia , B. Ghanem , L. Fei-Fei , and J. Niebles. 2017 a. End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos. In BMVC. S. Buch, V. Escorcia, B. Ghanem, L. Fei-Fei, and J. Niebles. 2017a. End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos. In BMVC.","journal-title":"J. Niebles."},{"key":"e_1_3_2_2_3_1","volume-title":"Single-Stream Temporal Action Detection in Untrimmed Videos. In BMVC","author":"Buch Shyamal","year":"2017","unstructured":"Shyamal Buch , Victor Escorcia , Bernard Ghanem , Li Fei-Fei , and Juan Carlos Niebles . 2017 b. End-to-End , Single-Stream Temporal Action Detection in Untrimmed Videos. In BMVC 2017. Shyamal Buch, Victor Escorcia, Bernard Ghanem, Li Fei-Fei, and Juan Carlos Niebles. 2017b. End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos. In BMVC 2017."},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"crossref","unstructured":"J. Carreira and A. Zisserman. 2017. Quo vadis action recognition? a new model and the kinetics dataset. In CVPR.  J. Carreira and A. Zisserman. 2017. Quo vadis action recognition? a new model and the kinetics dataset. In CVPR.","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"crossref","unstructured":"Y. Chao S. Vijayanarasimhan B. Seybold D. Ross J. Deng and R. Sukthankar. 2018. Rethinking the Faster R-CNN Architecture for Temporal Action Localization. In CVPR.  Y. Chao S. Vijayanarasimhan B. Seybold D. Ross J. Deng and R. Sukthankar. 2018. Rethinking the Faster R-CNN Architecture for Temporal Action Localization. In CVPR.","DOI":"10.1109\/CVPR.2018.00124"},{"key":"e_1_3_2_2_6_1","unstructured":"Xiyang Dai Bharat Singh Guyue Zhang Larry S Davis and Yan Qiu Chen. 2017. Temporal Context Network for Activity Localization in Videos. In ICCV.  Xiyang Dai Bharat Singh Guyue Zhang Larry S Davis and Yan Qiu Chen. 2017. Temporal Context Network for Activity Localization in Videos. In ICCV."},{"key":"e_1_3_2_2_7_1","doi-asserted-by":"crossref","unstructured":"Navneet Dalal Bill Triggs and Cordelia Schmid. 2006. Human detection using oriented histograms of flow and appearance. In ECCV.  Navneet Dalal Bill Triggs and Cordelia Schmid. 2006. Human detection using oriented histograms of flow and appearance. In ECCV.","DOI":"10.1007\/11744047_33"},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"crossref","unstructured":"Ali Diba Vivek Sharma and Luc Van Gool. 2017. Deep Temporal Linear Encoding Networks. In CVPR.  Ali Diba Vivek Sharma and Luc Van Gool. 2017. Deep Temporal Linear Encoding Networks. In CVPR.","DOI":"10.1109\/CVPR.2017.168"},{"key":"e_1_3_2_2_9_1","volume-title":"Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell.","author":"Donahue Jeffrey","year":"2015","unstructured":"Jeffrey Donahue , Lisa Anne Hendricks , Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015 . Long-term recurrent convolutional networks for visual recognition and description. In CVPR. Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In CVPR."},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"crossref","unstructured":"V. Escorcia F. Heilbron J. Niebles and B. Ghanem. 2016. DAPs: Deep Action Proposals for Action Understanding. In ECCV.  V. Escorcia F. Heilbron J. Niebles and B. Ghanem. 2016. DAPs: Deep Action Proposals for Action Understanding. In ECCV.","DOI":"10.1007\/978-3-319-46487-9_47"},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"crossref","unstructured":"Lijie Fan Wenbing Huang Chuang Gan Stefano Ermon Boqing Gong and Junzhou Huang. 2018. End-to-end learning of motion representation for video understanding. In CVPR. 6016--6025.  Lijie Fan Wenbing Huang Chuang Gan Stefano Ermon Boqing Gong and Junzhou Huang. 2018. End-to-end learning of motion representation for video understanding. In CVPR. 6016--6025.","DOI":"10.1109\/CVPR.2018.00630"},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"crossref","unstructured":"Christoph Feichtenhofer Axel Pinz and Richard Wildes. 2016. Spatiotemporal residual networks for video action recognition. In NIPS.  Christoph Feichtenhofer Axel Pinz and Richard Wildes. 2016. Spatiotemporal residual networks for video action recognition. In NIPS.","DOI":"10.1109\/CVPR.2017.787"},{"key":"e_1_3_2_2_13_1","doi-asserted-by":"crossref","unstructured":"Chuang Gan Chen Sun Lixin Duan and Boqing Gong. 2016a. Webly-supervised video recognition by mutually voting for relevant web images and web video frames. In ECCV. 849--866.  Chuang Gan Chen Sun Lixin Duan and Boqing Gong. 2016a. Webly-supervised video recognition by mutually voting for relevant web images and web video frames. In ECCV. 849--866.","DOI":"10.1007\/978-3-319-46487-9_52"},{"key":"e_1_3_2_2_14_1","volume-title":"Devnet: A deep event network for multimedia event detection and evidence recounting. In CVPR. 2568--2577.","author":"Gan Chuang","year":"2015","unstructured":"Chuang Gan , Naiyan Wang , Yi Yang , Dit-Yan Yeung , and Alex G Hauptmann . 2015 . Devnet: A deep event network for multimedia event detection and evidence recounting. In CVPR. 2568--2577. Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G Hauptmann. 2015. Devnet: A deep event network for multimedia event detection and evidence recounting. In CVPR. 2568--2577."},{"key":"e_1_3_2_2_15_1","unstructured":"Chuang Gan Ting Yao Kuiyuan Yang Yi Yang and Tao Mei. 2016b. You lead we exceed: Labor-free video concept learning by jointly exploiting web videos and images. In CVPR. 923--932.  Chuang Gan Ting Yao Kuiyuan Yang Yi Yang and Tao Mei. 2016b. You lead we exceed: Labor-free video concept learning by jointly exploiting web videos and images. In CVPR. 923--932."},{"key":"e_1_3_2_2_16_1","unstructured":"Jiyang Gao Zhenheng Yang and Ram Nevatia. 2017a. Cascaded Boundary Regression for Temporal Action Detection. In BMVC.  Jiyang Gao Zhenheng Yang and Ram Nevatia. 2017a. Cascaded Boundary Regression for Temporal Action Detection. In BMVC."},{"key":"e_1_3_2_2_17_1","unstructured":"Jiyang Gao Zhenheng Yang Chen Sun Kan Chen and Ram Nevatia. 2017b. Turn tap: Temporal unit regression network for temporal action proposals. In ICCV.  Jiyang Gao Zhenheng Yang Chen Sun Kan Chen and Ram Nevatia. 2017b. Turn tap: Temporal unit regression network for temporal action proposals. In ICCV."},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"crossref","unstructured":"Rohit Girdhar Deva Ramanan Abhinav Gupta Josef Sivic and Bryan Russell. 2017. ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification. In CVPR.  Rohit Girdhar Deva Ramanan Abhinav Gupta Josef Sivic and Bryan Russell. 2017. ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification. In CVPR.","DOI":"10.1109\/CVPR.2017.337"},{"key":"e_1_3_2_2_19_1","volume":"201","author":"He K.","unstructured":"K. He , X. Zhang , S. Ren , and J. Sun. 201 6. Deep Residual Learning for Image Recognition. In CVPR. K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR.","journal-title":"J. Sun."},{"key":"e_1_3_2_2_20_1","volume":"201","author":"Heilbron F.","unstructured":"F. Heilbron , V. Escorcia , B. Ghanem , and J. Niebles. 201 2. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR. F. Heilbron, V. Escorcia, B. Ghanem, and J. Niebles. 2012. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR.","journal-title":"J. Niebles."},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"crossref","unstructured":"F. Heilbron J. Niebles and B. Chanem. 2016. Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos. In CVPR.  F. Heilbron J. Niebles and B. Chanem. 2016. Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos. In CVPR.","DOI":"10.1109\/CVPR.2016.211"},{"key":"e_1_3_2_2_22_1","volume-title":"SCC: Semantic context cascade for efficient action detection. In CVPR.","author":"Heilbron F Caba","year":"2017","unstructured":"F Caba Heilbron , Wayner Barrios , Victor Escorcia , and Bernard Ghanem . 2017 . SCC: Semantic context cascade for efficient action detection. In CVPR. F Caba Heilbron, Wayner Barrios, Victor Escorcia, and Bernard Ghanem. 2017. SCC: Semantic context cascade for efficient action detection. In CVPR."},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"crossref","unstructured":"R. Hou R. Sukthankar and M. Shah. 2017. Real-time temporal action localization in untrimmed videos by sub-action discovery. In BMVC.  R. Hou R. Sukthankar and M. Shah. 2017. Real-time temporal action localization in untrimmed videos by sub-action discovery. In BMVC.","DOI":"10.5244\/C.31.91"},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"crossref","unstructured":"J. Hu L. Shen S. Albanie G. Sun and E. Wu. 2018. Squeeze-and-Excitation Networks. In CVPR.  J. Hu L. Shen S. Albanie G. Sun and E. Wu. 2018. Squeeze-and-Excitation Networks. In CVPR.","DOI":"10.1109\/CVPR.2018.00745"},{"key":"e_1_3_2_2_25_1","unstructured":"Y. Jiang J. Liu A. Zamir G. Toderici I. Laptev M. Shah and R. Sukthankar. 2014. THUMOS challenge: Action recognition with a large number of classes. In http:\/\/crcv.ucf.edu\/THUMOS14\/.  Y. Jiang J. Liu A. Zamir G. Toderici I. Laptev M. Shah and R. Sukthankar. 2014. THUMOS challenge: Action recognition with a large number of classes. In http:\/\/crcv.ucf.edu\/THUMOS14\/."},{"key":"e_1_3_2_2_26_1","unstructured":"S. Karaman L. Seidenari and A. Bimbo. 2014. Fast saliency based pooling of fisher encoded dense trajectories. In http:\/\/crcv.ucf.edu\/THUMOS14\/.  S. Karaman L. Seidenari and A. Bimbo. 2014. Fast saliency based pooling of fisher encoded dense trajectories. In http:\/\/crcv.ucf.edu\/THUMOS14\/."},{"key":"e_1_3_2_2_27_1","doi-asserted-by":"crossref","unstructured":"A. Karpathy G. Toderici S. Shetty T. Leung R. Sukthankar and L. Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In CVPR.  A. Karpathy G. Toderici S. Shetty T. Leung R. Sukthankar and L. Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In CVPR.","DOI":"10.1109\/CVPR.2014.223"},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"crossref","unstructured":"Alexander Klaser Marcin Marsza\u0142ek and Cordelia Schmid. 2008. A spatio-temporal descriptor based on 3d-gradients. In BMVC.  Alexander Klaser Marcin Marsza\u0142ek and Cordelia Schmid. 2008. A spatio-temporal descriptor based on 3d-gradients. In BMVC.","DOI":"10.5244\/C.22.99"},{"key":"e_1_3_2_2_29_1","doi-asserted-by":"crossref","unstructured":"Hei Law and Jia Deng. 2018. CornerNet: Detecting Objects as Paired Keypoints. In ECCV.  Hei Law and Jia Deng. 2018. CornerNet: Detecting Objects as Paired Keypoints. In ECCV.","DOI":"10.1007\/978-3-030-01264-9_45"},{"key":"e_1_3_2_2_30_1","volume-title":"Austin Reiter, and Gregory D Hager","author":"Lea Colin","year":"2017","unstructured":"Colin Lea , M. Flynn , Rene Vidal , Austin Reiter, and Gregory D Hager . 2017 . Temporal convolutional networks for action segmentation and detection. In CVPR. Colin Lea, M. Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager. 2017. Temporal convolutional networks for action segmentation and detection. In CVPR."},{"key":"e_1_3_2_2_31_1","volume-title":"Austin Reiter, and Gregory D Hager","author":"Lea Colin","year":"2016","unstructured":"Colin Lea , Rene Vidal , Austin Reiter, and Gregory D Hager . 2016 . Temporal convolutional networks: A unified approach to action segmentation. In ECCV. Colin Lea, Rene Vidal, Austin Reiter, and Gregory D Hager. 2016. Temporal convolutional networks: A unified approach to action segmentation. In ECCV."},{"key":"e_1_3_2_2_32_1","volume-title":"TSM: Temporal shift module for efficient video understanding. In CVPR. 7083--7093.","author":"Lin Ji","year":"2019","unstructured":"Ji Lin , Chuang Gan , and Song Han . 2019 . TSM: Temporal shift module for efficient video understanding. In CVPR. 7083--7093. Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal shift module for efficient video understanding. In CVPR. 7083--7093."},{"key":"e_1_3_2_2_33_1","volume-title":"R. Girshick, K. He, B. Hariharan, and S. Belongie.","author":"Lin T.","year":"2018","unstructured":"T. Lin , P. Doll\u00e1 r , R. Girshick, K. He, B. Hariharan, and S. Belongie. 2018 a. Feature Pyramid Networks for Object Detection. In ECCV. T. Lin, P. Doll\u00e1 r, R. Girshick, K. He, B. Hariharan, and S. Belongie. 2018a. Feature Pyramid Networks for Object Detection. In ECCV."},{"key":"e_1_3_2_2_34_1","unstructured":"Tianwei Lin Xu Zhao and Zheng Shou. 2017a. Single shot temporal action detection. In ACM Multimedia.  Tianwei Lin Xu Zhao and Zheng Shou. 2017a. Single shot temporal action detection. In ACM Multimedia."},{"key":"e_1_3_2_2_35_1","unstructured":"T. Lin X. Zhao and Z. Shou. 2017b. Temporal Convolution Based Action Proposal: Submission to ActivityNet 2017. arXiv preprint arXiv:1707.06750 (2017).  T. Lin X. Zhao and Z. Shou. 2017b. Temporal Convolution Based Action Proposal: Submission to ActivityNet 2017. arXiv preprint arXiv:1707.06750 (2017)."},{"key":"e_1_3_2_2_36_1","volume-title":"BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. In ECCV.","author":"Lin T.","year":"2018","unstructured":"T. Lin , X. Zhao , H. Su , C. Wang , and M. Yang . 2018 b. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. In ECCV. T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang. 2018b. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. In ECCV."},{"key":"e_1_3_2_2_37_1","volume-title":"SSD: Single Shot MultiBox Detector. In ECCV.","author":"Liu W.","year":"2016","unstructured":"W. Liu , D. Anguelov , D. Erhan , C. Szegedy , S. Reed , C. Fu , and A. Berg . 2016 . SSD: Single Shot MultiBox Detector. In ECCV. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. Berg. 2016. SSD: Single Shot MultiBox Detector. In ECCV."},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00043"},{"key":"e_1_3_2_2_39_1","volume-title":"Xiao Liu, Yandong Li, Fu Li, and Shilei Wen.","author":"Long Xiang","year":"2018","unstructured":"Xiang Long , Chuang Gan , Gerard De Melo , Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. 2018 a. Multimodal keyless attention fusion for video classification. In AAAI. Xiang Long, Chuang Gan, Gerard De Melo, Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. 2018a. Multimodal keyless attention fusion for video classification. In AAAI."},{"key":"e_1_3_2_2_40_1","volume-title":"Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification. In CVPR.","author":"Long Xiang","year":"2018","unstructured":"Xiang Long , Chuang Gan , Gerard de Melo , Jiajun Wu , Xiao Liu , and Shilei Wen . 2018 b. Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification. In CVPR. Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018b. Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification. In CVPR."},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"crossref","unstructured":"S. Ma L. Sigal and S. Sclaroff. 2016. Learning activity progression in LSTMs for activity detection and early detection. In CVPR.  S. Ma L. Sigal and S. Sclaroff. 2016. Learning activity progression in LSTMs for activity detection and early detection. In CVPR.","DOI":"10.1109\/CVPR.2016.214"},{"key":"e_1_3_2_2_42_1","unstructured":"D. Oneata J. Verbeek and C. Schmid. 2014. The LEAR submission at thumos 2014. In http:\/\/crcv.ucf.edu\/THUMOS14\/.  D. Oneata J. Verbeek and C. Schmid. 2014. The LEAR submission at thumos 2014. In http:\/\/crcv.ucf.edu\/THUMOS14\/."},{"key":"e_1_3_2_2_43_1","doi-asserted-by":"crossref","unstructured":"Z. Qiu T. Yao and T. Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV.  Z. Qiu T. Yao and T. Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV.","DOI":"10.1109\/ICCV.2017.590"},{"key":"e_1_3_2_2_44_1","volume":"201","author":"Ren S.","unstructured":"S. Ren , K. He , R. Girshick , and J. Sun. 201 5. Faster R-CNN. In NIPS. S. Ren, K. He, R. Girshick, and J. Sun. 2015. Faster R-CNN. In NIPS.","journal-title":"J. Sun."},{"key":"e_1_3_2_2_45_1","volume":"201","author":"Richard A.","unstructured":"A. Richard and J. Gall. 201 6. Temporal action detection using a statistical language model. In CVPR. A. Richard and J. Gall. 2016. Temporal action detection using a statistical language model. In CVPR.","journal-title":"J. Gall."},{"key":"e_1_3_2_2_46_1","doi-asserted-by":"crossref","unstructured":"G. Rirshick. 2015. Fast R-CNN. In ICCV.  G. Rirshick. 2015. Fast R-CNN. In ICCV.","DOI":"10.1109\/ICCV.2015.169"},{"key":"e_1_3_2_2_47_1","volume":"201","author":"Rirshick G.","unstructured":"G. Rirshick , J. Donashue , T. Darrell , and J. Malik. 201 4. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR. G. Rirshick, J. Donashue, T. Darrell, and J. Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.","journal-title":"J. Malik."},{"key":"e_1_3_2_2_48_1","doi-asserted-by":"crossref","unstructured":"Paul Scovanner Saad Ali and Mubarak Shah. 2007. A 3-dimensional sift descriptor and its application to action recognition. In ACM Multimedia.  Paul Scovanner Saad Ali and Mubarak Shah. 2007. A 3-dimensional sift descriptor and its application to action recognition. In ACM Multimedia.","DOI":"10.1145\/1291233.1291311"},{"key":"e_1_3_2_2_49_1","doi-asserted-by":"crossref","unstructured":"Yemin Shi Yonghong Tian Yaowei Wang Wei Zeng and Tiejun Huang. 2017. Learning Long-Term Dependencies for Action Recognition With a Biologically-Inspired Deep Network. In ICCV.  Yemin Shi Yonghong Tian Yaowei Wang Wei Zeng and Tiejun Huang. 2017. Learning Long-Term Dependencies for Action Recognition With a Biologically-Inspired Deep Network. In ICCV.","DOI":"10.1109\/ICCV.2017.84"},{"key":"e_1_3_2_2_50_1","doi-asserted-by":"crossref","unstructured":"Zheng Shou Jonathan Chan Alireza Zareian Kazuyuki Miyazawa and Shih-Fu Chang. 2017. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In CVPR.  Zheng Shou Jonathan Chan Alireza Zareian Kazuyuki Miyazawa and Shih-Fu Chang. 2017. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In CVPR.","DOI":"10.1109\/CVPR.2017.155"},{"key":"e_1_3_2_2_51_1","doi-asserted-by":"crossref","unstructured":"Zheng Shou Dongang Wang and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR.  Zheng Shou Dongang Wang and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR.","DOI":"10.1109\/CVPR.2016.119"},{"key":"e_1_3_2_2_52_1","unstructured":"K. Simonyan and A. Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In NIPS.  K. Simonyan and A. Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In NIPS."},{"key":"e_1_3_2_2_53_1","unstructured":"G. Singh and F. Cuzzolin. 2016. Untrimmed video classification for activity detection: submission to ActivityNet challenge. In ActivityNet Large Scale Activity Recognition Challenge.  G. Singh and F. Cuzzolin. 2016. Untrimmed video classification for activity detection: submission to ActivityNet challenge. In ActivityNet Large Scale Activity Recognition Challenge."},{"key":"e_1_3_2_2_54_1","doi-asserted-by":"crossref","unstructured":"K. Tang B. Yao L. Fei-Fei and D. Koller. 2013. Combining the right features for complex event recognition. In CVPR.  K. Tang B. Yao L. Fei-Fei and D. Koller. 2013. Combining the right features for complex event recognition. In CVPR.","DOI":"10.1109\/ICCV.2013.335"},{"key":"e_1_3_2_2_55_1","doi-asserted-by":"crossref","unstructured":"D. Tran L. Bourdev R. Fergus L. Torresani and M. Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In ICCV.  D. Tran L. Bourdev R. Fergus L. Torresani and M. Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In ICCV.","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_2_2_56_1","doi-asserted-by":"crossref","unstructured":"Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In ICCV.  Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In ICCV.","DOI":"10.1109\/ICCV.2013.441"},{"key":"e_1_3_2_2_57_1","doi-asserted-by":"crossref","unstructured":"Limin Wang Wei Li Wen Li and Luc Van Gool. 2018b. Appearance-and-Relation Networks for Video Classification. In CVPR.  Limin Wang Wei Li Wen Li and Luc Van Gool. 2018b. Appearance-and-Relation Networks for Video Classification. In CVPR.","DOI":"10.1109\/CVPR.2018.00155"},{"key":"e_1_3_2_2_58_1","unstructured":"L. Wang Y. Qiao and X. Tang. 2014. Action recognition and detection by combining motion and appearance features. In http:\/\/crcv.ucf.edu\/THUMOS14\/.  L. Wang Y. Qiao and X. Tang. 2014. Action recognition and detection by combining motion and appearance features. In http:\/\/crcv.ucf.edu\/THUMOS14\/."},{"key":"e_1_3_2_2_59_1","doi-asserted-by":"crossref","unstructured":"Limin Wang Yuanjun Xiong Dahua Lin and Luc Van Gool. 2017. Untrimmednets for weakly supervised action recognition and detection. In CVPR.  Limin Wang Yuanjun Xiong Dahua Lin and Luc Van Gool. 2017. Untrimmednets for weakly supervised action recognition and detection. In CVPR.","DOI":"10.1109\/CVPR.2017.678"},{"key":"e_1_3_2_2_60_1","doi-asserted-by":"crossref","unstructured":"L. Wang Y. Xiong Z. Wang Y. Qiao D. Lin and X. Tang. 2016. Temporal segment networks: Towards good practices for deep action recognition. In ECCV.  L. Wang Y. Xiong Z. Wang Y. Qiao D. Lin and X. Tang. 2016. Temporal segment networks: Towards good practices for deep action recognition. In ECCV.","DOI":"10.1007\/978-3-319-46484-8_2"},{"key":"e_1_3_2_2_61_1","unstructured":"R. Wang and D. Tao. 2016. UTS at ActivityNet 2016. In ActivityNet Large Scale Activity Recognition Challenge.  R. Wang and D. Tao. 2016. UTS at ActivityNet 2016. In ActivityNet Large Scale Activity Recognition Challenge."},{"key":"e_1_3_2_2_62_1","doi-asserted-by":"crossref","unstructured":"Xiaolong Wang Ross Girshick Abhinav Gupta and Kaiming He. 2018a. Non-local Neural Networks. In CVPR.  Xiaolong Wang Ross Girshick Abhinav Gupta and Kaiming He. 2018a. Non-local Neural Networks. In CVPR.","DOI":"10.1109\/CVPR.2018.00813"},{"key":"e_1_3_2_2_63_1","unstructured":"Saining Xie Chen Sun Jonathan Huang Zhuowen Tu and Kevin Murphy. 2018. Rethinking Spatiotemporal Feature Learning For Video Understanding. In ECCV.  Saining Xie Chen Sun Jonathan Huang Zhuowen Tu and Kevin Murphy. 2018. Rethinking Spatiotemporal Feature Learning For Video Understanding. In ECCV."},{"key":"e_1_3_2_2_64_1","unstructured":"Y. Xiong Y. Zhao L. Wang D. Lin and X. Tang. 2017. A Pursuit of Temporal Accuracy in General Activity Detection. In CVPR.  Y. Xiong Y. Zhao L. Wang D. Lin and X. Tang. 2017. A Pursuit of Temporal Accuracy in General Activity Detection. In CVPR."},{"key":"e_1_3_2_2_65_1","unstructured":"Huijuan Xu Abir Das and Kate Saenko. 2017. R-c3d: Region convolutional 3d network for temporal activity detection. In ICCV.  Huijuan Xu Abir Das and Kate Saenko. 2017. R-c3d: Region convolutional 3d network for temporal activity detection. In ICCV."},{"key":"e_1_3_2_2_66_1","doi-asserted-by":"crossref","unstructured":"S. Yeung O. Russakovsky G. Mori and L. Fei-Fei. 2016a. End-to-end learning of action detection from frame glimpses in videos. In CVPR.  S. Yeung O. Russakovsky G. Mori and L. Fei-Fei. 2016a. End-to-end learning of action detection from frame glimpses in videos. In CVPR.","DOI":"10.1109\/CVPR.2016.293"},{"key":"e_1_3_2_2_67_1","doi-asserted-by":"crossref","unstructured":"S. Yeung O. Russakovsky G. Mori and L. Fei-Fei. 2016b. End-to-end learning of action detection from frame glimpses in videos. In CVPR.  S. Yeung O. Russakovsky G. Mori and L. Fei-Fei. 2016b. End-to-end learning of action detection from frame glimpses in videos. In CVPR.","DOI":"10.1109\/CVPR.2016.293"},{"key":"e_1_3_2_2_68_1","doi-asserted-by":"crossref","unstructured":"J. Yuan B. Ni X. Yang and A. Kssim. 2016. Temporal action localization with pyramid of score distribution features. In CVPR.  J. Yuan B. Ni X. Yang and A. Kssim. 2016. Temporal action localization with pyramid of score distribution features. In CVPR.","DOI":"10.1109\/CVPR.2016.337"},{"key":"e_1_3_2_2_69_1","doi-asserted-by":"crossref","unstructured":"Zehuan Yuan Jonathan C Stroud Tong Lu and Jia Deng. 2017. Temporal Action Localization by Structured Maximal Sums. In CVPR.  Zehuan Yuan Jonathan C Stroud Tong Lu and Jia Deng. 2017. Temporal Action Localization by Structured Maximal Sums. In CVPR.","DOI":"10.1109\/CVPR.2017.342"},{"volume-title":"Proceedings of the 29th DAGM Conference on Pattern Recognition.","author":"Zach C.","key":"e_1_3_2_2_70_1","unstructured":"C. Zach , T. Pock , and H. Bischof . 2007. A Duality Based Approach for Realtime TV-L1 Optical Flow . In Proceedings of the 29th DAGM Conference on Pattern Recognition. C. Zach, T. Pock, and H. Bischof. 2007. A Duality Based Approach for Realtime TV-L1 Optical Flow. In Proceedings of the 29th DAGM Conference on Pattern Recognition."},{"key":"e_1_3_2_2_71_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00719"},{"key":"e_1_3_2_2_72_1","doi-asserted-by":"crossref","unstructured":"Yue Zhao Yuanjun Xiong Limin Wang Zhirong Wu Xiaoou Tang and Dahua Lin. 2017a. Temporal action detection with structured segment networks. In ICCV.  Yue Zhao Yuanjun Xiong Limin Wang Zhirong Wu Xiaoou Tang and Dahua Lin. 2017a. Temporal action detection with structured segment networks. In ICCV.","DOI":"10.1109\/ICCV.2017.317"},{"key":"e_1_3_2_2_73_1","unstructured":"Y. Zhao B. Zhang Z. Wu S. Yang L. Zhou S. Yan L. Wang Y. Xiong D. Lin Y. Qiao and X. Tang. 2017b. Cuhk & ethz & siat submission to activitynet challenge 2017. arXiv preprint arXiv:1710.08011 (2017).  Y. Zhao B. Zhang Z. Wu S. Yang L. Zhou S. Yan L. Wang Y. Xiong D. Lin Y. Qiao and X. Tang. 2017b. Cuhk & ethz & siat submission to activitynet challenge 2017. arXiv preprint arXiv:1710.08011 (2017)."}],"event":{"name":"MM '20: The 28th ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Seattle WA USA","acronym":"MM '20"},"container-title":["Proceedings of the 28th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3394171.3413860","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3394171.3413860","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:01:18Z","timestamp":1750197678000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3394171.3413860"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,10,12]]},"references-count":73,"alternative-id":["10.1145\/3394171.3413860","10.1145\/3394171"],"URL":"https:\/\/doi.org\/10.1145\/3394171.3413860","relation":{},"subject":[],"published":{"date-parts":[[2020,10,12]]},"assertion":[{"value":"2020-10-12","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}