{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,27]],"date-time":"2025-10-27T21:06:10Z","timestamp":1761599170547,"version":"build-2065373602"},"reference-count":57,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2018,6,21]],"date-time":"2018-06-21T00:00:00Z","timestamp":1529539200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61629301, 61773312, and 61503296"],"award-info":[{"award-number":["61629301, 61773312, and 61503296"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100012166","name":"National Key Research and Development Program of China","doi-asserted-by":"publisher","award":["2016YFB1000903"],"award-info":[{"award-number":["2016YFB1000903"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Research in human action recognition has accelerated significantly since the introduction of powerful machine learning tools such as Convolutional Neural Networks (CNNs). However, effective and efficient methods for incorporation of temporal information into CNNs are still being actively explored in the recent literature. Motivated by the popular recurrent attention models in the research area of natural language processing, we propose the Attention-aware Temporal Weighted CNN (ATW CNN) for action recognition in videos, which embeds a visual attention model into a temporal weighted multi-stream CNN. This attention model is simply implemented as temporal weighting yet it effectively boosts the recognition performance of video representations. Besides, each stream in the proposed ATW CNN framework is capable of end-to-end training, with both network parameters and temporal weights optimized by stochastic gradient descent (SGD) with back-propagation. Our experimental results on the UCF-101 and HMDB-51 datasets showed that the proposed attention mechanism contributes substantially to the performance gains with the more discriminative snippets by focusing on more relevant video segments.<\/jats:p>","DOI":"10.3390\/s18071979","type":"journal-article","created":{"date-parts":[[2018,6,22]],"date-time":"2018-06-22T02:46:21Z","timestamp":1529635581000},"page":"1979","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":33,"title":["Action Recognition by an Attention-Aware Temporal Weighted Convolutional Neural Network"],"prefix":"10.3390","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6636-6396","authenticated-orcid":false,"given":"Le","family":"Wang","sequence":"first","affiliation":[{"name":"Institute of Artificial Intelligence and Robotics, Xi\u2019an Jiaotong University, Xi\u2019an 710049, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jinliang","family":"Zang","sequence":"additional","affiliation":[{"name":"Institute of Artificial Intelligence and Robotics, Xi\u2019an Jiaotong University, Xi\u2019an 710049, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7917-9749","authenticated-orcid":false,"given":"Qilin","family":"Zhang","sequence":"additional","affiliation":[{"name":"HERE Technologies, Chicago, IL 60606, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhenxing","family":"Niu","sequence":"additional","affiliation":[{"name":"Alibaba Group, Hangzhou 311121, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Gang","family":"Hua","sequence":"additional","affiliation":[{"name":"Microsoft Research, Redmond, WA 98052, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Nanning","family":"Zheng","sequence":"additional","affiliation":[{"name":"Institute of Artificial Intelligence and Robotics, Xi\u2019an Jiaotong University, Xi\u2019an 710049, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2018,6,21]]},"reference":[{"key":"ref_1","unstructured":"Simonyan, K., and Zisserman, A. (2015, January 7\u20139). Very deep convolutional networks for large-scale image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA."},{"key":"ref_2","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (July, January 26). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA."},{"key":"ref_3","unstructured":"Wang, L., Xue, J., Zheng, N., and Hua, G. (2011, January 6\u201313). Automatic salient object extraction with contextual cue. Proceedings of the IEEE International Conference on Computer Vision, Barcelona, Spain."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"2074","DOI":"10.1109\/TPAMI.2016.2612187","article-title":"Video object discovery and co-segmentation with extremely weak supervision","volume":"39","author":"Wang","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Long, J., Shelhamer, E., and Darrell, T. (2015, January 7\u201312). Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"ref_6","unstructured":"Simonyan, K., and Zisserman, A. (2014, January 8\u201313). Two-stream convolutional networks for action recognition in videos. Proceedings of the Advances in Neural Information Processing Systems, Montr\u00e9al, QC, Canada."},{"key":"ref_7","unstructured":"Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., and Toderici, G. (2015, January 7\u201312). Beyond short snippets: Deep networks for video classification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., and Van Gool, L. (2016, January 8\u201316). Temporal segment networks: Towards good practices for deep action recognition. Proceedings of the IEEE Conference on European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46484-8_2"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Carreira, J., and Zisserman, A. (2017, January 22\u201325). Quo vadis, action recognition? a new model and the kinetics dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.502"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., and Darrell, T. (2015, January 7\u201312). Long-term recurrent convolutional networks for visual recognition and description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298878"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015, January 13\u201316). Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.510"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Ch\u00e9ron, G., Laptev, I., and Schmid, C. (2015, January 13\u201316). P-cnn: Pose-based cnn features for action recognition. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.368"},{"key":"ref_13","unstructured":"Feichtenhofer, C., Pinz, A., and Zisserman, A. (July, January 26). Convolutional two-stream network fusion for video action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Huang, J., Zhou, W., Zhang, Q., Li, H., and Li, W. (2018, January 2\u20137). Video-based sign language recognition without temporal segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.11903"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Wang, L., Duan, X., Zhang, Q., Niu, Z., Hua, G., and Zheng, N. (2018). Segment-tube: Spatio-temporal action localization in untrimmed videos with per-frame segmentation. Sensors, 18.","DOI":"10.3390\/s18051657"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Duan, X., Wang, L., Zhai, C., Zhang, Q., Niu, Z., Zheng, N., and Hua, G. (2018, January 7\u201310). Joint spatio-temporal action localization in untrimmed videos with per-frame segmentation. Proceedings of the IEEE International Conference on Image Processing, Athens, Greece.","DOI":"10.1109\/ICIP.2018.8451692"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Gao, Z., Hua, G., Zhang, D., Jojic, N., Wang, L., Xue, J., and Zheng, N. (2017, January 21\u201326). ER3: A unified framework for event retrieval, recognition and recounting. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.227"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1109\/TPAMI.2012.59","article-title":"3D convolutional neural networks for human action recognition","volume":"35","author":"Ji","year":"2013","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Wang, H., and Schmid, C. (2013, January 3\u20136). Action recognition with improved trajectories. Proceedings of the IEEE International Conference on Computer Vision, Sydney, NSW, Australia.","DOI":"10.1109\/ICCV.2013.441"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Schuldt, C., Laptev, I., and Caputo, B. (2004, January 23\u201326). Recognizing human actions: a local SVM approach. Proceedings of the IEEE International Conference on Pattern Recognition, Cambridge, UK.","DOI":"10.1109\/ICPR.2004.1334462"},{"key":"ref_21","unstructured":"Soomro, K., Zamir, A.R., and Shah, M. (arXiv, 2012). UCF101: A dataset of 101 human actions classes from videos in the wild, arXiv."},{"key":"ref_22","unstructured":"Nagel, W., Kr\u00f6ner, D., and Resch, M. (2013). HMDB51: A large video database for human motion recognition. High Performance Computing in Science and Engineering, Springer."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Luong, M.T., Pham, H., and Manning, C.D. (2015, January 17\u201321). Effective approaches to attention-based neural machine translation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal.","DOI":"10.18653\/v1\/D15-1166"},{"key":"ref_24","unstructured":"Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015, January 6\u201311). Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the International Conference on Machine Learning, Lille, France."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Zang, J., Wang, L., Liu, Z., Zhang, Q., Niu, Z., Hua, G., and Zheng, N. (2018, January 25\u201327). Attention-based temporal weighted convolutional neural network for action recognition. Proceedings of the International Conference on Artificial Intelligence Applications and Innovations, Rhodes, Greece.","DOI":"10.1007\/978-3-319-92007-8_9"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"107","DOI":"10.1007\/s11263-005-1838-7","article-title":"On space-time interest points","volume":"64","author":"Laptev","year":"2005","journal-title":"Int. J. Comput. Vis."},{"key":"ref_27","unstructured":"Wang, H., Kl\u00e4ser, A., Schmid, C., and Liu, C.L. (2011, January 20\u201325). Action recognition by dense trajectories. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado, CO, USA."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"109","DOI":"10.1016\/j.cviu.2016.03.013","article-title":"Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice","volume":"150","author":"Peng","year":"2016","journal-title":"Comput. Vis. Image Underst."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"817","DOI":"10.1109\/TCYB.2013.2273174","article-title":"Spatio-temporal Laplacian pyramid coding for action recognition","volume":"44","author":"Shao","year":"2014","journal-title":"IEEE Trans. Cybern."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014, January 24\u201327). Large-scale video classification with convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognitionm, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.223"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Ran, L., Zhang, Y., Wei, W., and Zhang, Q. (2017). A hyperspectral image classification framework with spatial pixel pair features. Sensors, 17.","DOI":"10.3390\/s17102421"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Ran, L., Zhang, Y., Zhang, Q., and Yang, T. (2017). Convolutional neural network-based robot navigation using uncalibrated spherical images. Sensors, 17.","DOI":"10.3390\/s17061341"},{"key":"ref_33","unstructured":"Wang, J., Liu, Z., Wu, Y., and Yuan, J. (2012, January 16\u201321). Mining actionlet ensemble for action recognition with depth cameras. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA."},{"key":"ref_34","unstructured":"Du, Y., Wang, W., and Wang, L. (2015, January 7\u201312). Hierarchical recurrent neural network for skeleton based action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Zhang, Q., and Hua, G. (2015, January 26\u201330). Multi-view visual recognition of imperfect testing data. Proceedings of the ACM International Conference on Multimedia, Brisbane, Australia.","DOI":"10.1145\/2733373.2806224"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"633","DOI":"10.3390\/s18020633","article-title":"Exploring 3D human action recognition: From offline to online","volume":"18","author":"Liu","year":"2018","journal-title":"Sensors"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Hachaj, T., Piekarczyk, M., and Ogiela, M.R. (2017). Human actions analysis: templates generation, matching and visualization applied to motion capture of highly-skilled karate athletes. Sensors, 17.","DOI":"10.3390\/s17112590"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Zhang, Q., Hua, G., Liu, W., Liu, Z., and Zhang, Z. (2014, January 1\u20135). Can visual recognition benefit from auxiliary information in training?. Proceedings of the Asian Conference on Computer Vision, Singapore.","DOI":"10.1007\/978-3-319-16865-4_5"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"138","DOI":"10.2197\/ipsjtcva.7.138","article-title":"Auxiliary training information assisted visual recognition","volume":"7","author":"Zhang","year":"2015","journal-title":"IPSJ Trans. Comput. Vis. Appl."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Sun, L., Jia, K., Yeung, D.Y., and Shi, B.E. (2015, January 13\u201316). Human action recognition using factorized spatio-temporal convolutional networks. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.522"},{"key":"ref_41","unstructured":"Srivastava, N., Mansimov, E., and Salakhudinov, R. (2015, January 6\u201311). Unsupervised learning of video representations using lstms. Proceedings of the International Conference on Machine Learning, Lille, France."},{"key":"ref_42","unstructured":"Mahasseni, B., and Todorovic, S. (July, January 26). Regularizing long short term memory with 3D human-skeleton sequences for action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Wang, L., Qiao, Y., and Tang, X. (2015, January 7\u201312). Action recognition with trajectory-pooled deep-convolutional descriptors. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299059"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Liu, Z., Wang, L., and Zheng, N. (2018, January 25\u201327). Content-aware attention network for action recognition. Proceedings of the International Conference on Artificial Intelligence Applications and Innovations, Rhodes, Greece.","DOI":"10.1007\/978-3-319-92007-8_10"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., and Courville, A. (2015, January 13\u201316). Describing videos by exploiting temporal structure. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.512"},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"2782","DOI":"10.1109\/TPAMI.2013.65","article-title":"Temporal localization of actions with actoms","volume":"35","author":"Gaidon","year":"2013","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Kataoka, H., Satoh, Y., Aoki, Y., Oikawa, S., and Matsui, Y. (2018). Temporal and fine-grained pedestrian action recognition on driving recorder database. Sensors, 18.","DOI":"10.3390\/s18020627"},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"1510","DOI":"10.1109\/TPAMI.2017.2712608","article-title":"Long-term temporal convolutions for action recognition","volume":"40","author":"Varol","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_49","unstructured":"Zhu, W., Hu, J., Sun, G., Cao, X., and Qiao, Y. (July, January 26). A key volume mining deep framework for action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA."},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"254","DOI":"10.1007\/s11263-015-0859-0","article-title":"MoFAP: A multi-level representation for action recognition","volume":"119","author":"Wang","year":"2016","journal-title":"Int. J. Comput. Vis."},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Fernando, B., Gavves, S., Mogrovejo, O., Antonio, J., Ghodrati, A., and Tuytelaars, T. (2015, January 7\u201312). Modeling video evolution for action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299176"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Ni, B., Moulin, P., Yang, X., and Yan, S. (2015, January 7\u201312). Motion part regularization: Improving action recognition via trajectory selection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298993"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Zhang, Q., Abeida, H., Xue, M., Rowe, W., and Li, J. (2011, January 6\u20139). Fast implementation of sparse iterative covariance-based estimation for array processing. Proceedings of the Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA.","DOI":"10.1109\/ACSSC.2011.6190383"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and Fei-Fei, L. (2009, January 20\u201325). Imagenet: A large-scale hierarchical image database. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_55","unstructured":"Ioffe, S., and Szegedy, C. (2015, January 6\u201311). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Machine Learning, Lille, France."},{"key":"ref_56","unstructured":"Paszke, A., Gross, S., Chintala, S., and Chanan, G. (2017, January 28). Pytorch. Available online: https:\/\/github.com\/pytorch\/pytorch."},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Cai, Z., Wang, L., Peng, X., and Qiao, Y. (2014, January 24\u201327). Multi-view super vector for action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Ohio.","DOI":"10.1109\/CVPR.2014.83"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/18\/7\/1979\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T15:09:32Z","timestamp":1760195372000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/18\/7\/1979"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,6,21]]},"references-count":57,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2018,7]]}},"alternative-id":["s18071979"],"URL":"https:\/\/doi.org\/10.3390\/s18071979","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2018,6,21]]}}}