{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,17]],"date-time":"2025-10-17T14:17:13Z","timestamp":1760710633068,"version":"build-2065373602"},"reference-count":81,"publisher":"MDPI AG","issue":"14","license":[{"start":{"date-parts":[[2021,7,10]],"date-time":"2021-07-10T00:00:00Z","timestamp":1625875200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Human action recognition methods in videos based on deep convolutional neural networks usually use random cropping or its variants for data augmentation. However, this traditional data augmentation approach may generate many non-informative samples (video patches covering only a small part of the foreground or only the background) that are not related to a specific action. These samples can be regarded as noisy samples with incorrect labels, which reduces the overall action recognition performance. In this paper, we attempt to mitigate the impact of noisy samples by proposing an Auto-augmented Siamese Neural Network (ASNet). In this framework, we propose backpropagating salient patches and randomly cropped samples in the same iteration to perform gradient compensation to alleviate the adverse gradient effects of non-informative samples. Salient patches refer to the samples containing critical information for human action recognition. The generation of salient patches is formulated as a Markov decision process, and a reinforcement learning agent called SPA (Salient Patch Agent) is introduced to extract patches in a weakly supervised manner without extra labels. Extensive experiments were conducted on two well-known datasets UCF-101 and HMDB-51 to verify the effectiveness of the proposed SPA and ASNet.<\/jats:p>","DOI":"10.3390\/s21144720","type":"journal-article","created":{"date-parts":[[2021,7,11]],"date-time":"2021-07-11T22:16:48Z","timestamp":1626041808000},"page":"4720","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["ASNet: Auto-Augmented Siamese Neural Network for Action Recognition"],"prefix":"10.3390","volume":"21","author":[{"given":"Yujia","family":"Zhang","sequence":"first","affiliation":[{"name":"Department of Electrical Engineering, City University of Hong Kong, Hong Kong, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lai-Man","family":"Po","sequence":"additional","affiliation":[{"name":"Department of Electrical Engineering, City University of Hong Kong, Hong Kong, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5288-3605","authenticated-orcid":false,"given":"Jingjing","family":"Xiong","sequence":"additional","affiliation":[{"name":"Department of Electrical Engineering, City University of Hong Kong, Hong Kong, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yasar Abbas Ur","family":"REHMAN","sequence":"additional","affiliation":[{"name":"TCL Corporate Research Co. Limited, Hong Kong, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9586-2812","authenticated-orcid":false,"given":"Kwok-Wai","family":"Cheung","sequence":"additional","affiliation":[{"name":"School of Communication, The Hang Seng University of Hong Kong, Hong Kong, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,7,10]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"480","DOI":"10.1016\/j.eswa.2017.09.029","article-title":"Abnormal behavior recognition for intelligent video surveillance systems: A review","volume":"91","author":"Mabrouk","year":"2018","journal-title":"Expert Syst. Appl."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1550147716665520","DOI":"10.1177\/1550147716665520","article-title":"A review on applications of activity recognition systems with regard to performance and evaluation","volume":"12","author":"Ranasinghe","year":"2016","journal-title":"Int. J. Distrib. Sens. Netw."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1473","DOI":"10.1109\/TCSVT.2008.2005594","article-title":"Machine recognition of human activities: A survey","volume":"18","author":"Turaga","year":"2008","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"102846","DOI":"10.1016\/j.jvcir.2020.102846","article-title":"Spatial-temporal saliency action mask attention network for action recognition","volume":"71","author":"Jiang","year":"2020","journal-title":"J. Vis. Commun. Image Represent."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Zuo, Q., Zou, L., Fan, C., Li, D., Jiang, H., and Liu, Y. (2020). Whole and Part Adaptive Fusion Graph Convolutional Networks for Skeleton-Based Action Recognition. Sensors, 20.","DOI":"10.3390\/s20247149"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Carreira, J., and Zisserman, A. (2017, January 21\u201326). Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.502"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Crasto, N., Weinzaepfel, P., Alahari, K., and Schmid, C. (2019, January 16\u201320). Mars: Motion-Augmented Rgb Stream for Action Recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00807"},{"key":"ref_8","unstructured":"Feichtenhofer, C., Fan, H., Malik, J., and He, K. (November, January 27). Slowfast networks for video recognition. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seul, Korea."},{"key":"ref_9","unstructured":"Fan, Q., Chen, C.-F., Kuehne, H., Pistoia, M., and Cox, D. (2019). More is less: Learning efficient video representations by big-little network and depthwise temporal aggregation. arXiv."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., and Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-46484-8_2"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Hara, K., Kataoka, H., and Satoh, Y. Can Spatiotemporal 3d Cnns Retrace the History of 2d Cnns and Imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18\u201323 June 2018.","DOI":"10.1109\/CVPR.2018.00685"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Dong, M., Fang, Z., Li, Y., Bi, S., and Chen, J. (2021). AR3D: Attention Residual 3D Network for Human Action Recognition. Sensors, 21.","DOI":"10.3390\/s21051656"},{"key":"ref_13","unstructured":"Soomro, K., Zamir, A.R., and Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., and Mueller-Freitag, M. (2017, January 22\u201329). The \u201csomething something\u201d video database for learning and evaluating visual common sense. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.622"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T. (2011, January 6\u201313). HMDB: A large video database for human motion recognition. Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain.","DOI":"10.1109\/ICCV.2011.6126543"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Sun, S., Lei, L., Liu, H., and Xie, H. (2021). STAC: Spatial-Temporal Attention on Compensation Information for Activity Recognition in FPV. Sensors, 21.","DOI":"10.3390\/s21041106"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015, January 7\u201313). Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.510"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., and Paluri, M. (2018, January 18\u201323). A closer look at spatiotemporal convolutions for action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00675"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C. (2020, January 13\u201319). X3d: Expanding architectures for efficient video recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00028"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., and Wang, L. (2020, January 13\u201319). Tea: Temporal excitation and aggregation for action recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00099"},{"key":"ref_21","unstructured":"Zhang, S., Guo, S., Huang, W., Scott, M.R., and Wang, L. (2020). V4D: 4D Convolutional neural networks for video-level representation learning. arXiv."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Li, X., Shuai, B., and Tighe, J. (2020). Directional temporal modeling for action recognition. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-030-58539-6_17"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Bekker, A.J., and Goldberger, J. (2016, January 20\u201325). Training Deep Neural-Networks Based on Unreliable Labels. Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472164"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., and Qu, L. (2017, January 21\u201326). Making deep neural networks robust to label noise: A loss correction approach. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.240"},{"key":"ref_25","unstructured":"Rolnick, D., Veit, A., Belongie, S., and Shavit, N. (2017). Deep learning is robust to massive label noise. arXiv."},{"key":"ref_26","unstructured":"Song, H., Kim, M., Park, D., Shin, Y., and Lee, J.-G. (2020). Learning from noisy labels with deep neural networks: A survey. arXiv."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Tran, D., Wang, H., Torresani, L., and Feiszli, M. (2019, January 27\u201328). Video classification with channel-separated convolutional networks. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00565"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Liu, J., Luo, J., and Shah, M. (2009, January 20\u201325). Recognizing realistic actions from videos \u201cin the wild\u201d. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206744"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Niebles, J.C., Chen, C.-W., and Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-642-15552-9_29"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Wang, H., and Schmid, C. (2013, January 1\u20138). Action recognition with improved trajectories. Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia.","DOI":"10.1109\/ICCV.2013.441"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"60","DOI":"10.1007\/s11263-012-0594-8","article-title":"Dense trajectories and motion boundary descriptors for action recognition","volume":"103","author":"Wang","year":"2013","journal-title":"Int. J. Comput. Vis."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Kantorov, V., and Laptev, I. (2014, January 23\u201328). Efficient feature extraction, encoding and classification for action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.332"},{"key":"ref_33","unstructured":"Simonyan, K., and Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. arXiv."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., and Darrell, T. (2015, January 7\u201312). Long-term recurrent convolutional networks for visual recognition and description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298878"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Wang, X., Girshick, R., Gupta, A., and He, K. (2018, January 18\u201322). Non-local neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00813"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1109\/TPAMI.2012.59","article-title":"3D convolutional neural networks for human action recognition","volume":"35","author":"Ji","year":"2012","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_37","unstructured":"Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2018, January 23\u201328). Large-scale video classification with convolutional neural networks. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Columbus, OH, USA."},{"key":"ref_38","unstructured":"Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Qiu, Z., Yao, T., and Mei, T. (2017, January 22\u201329). Learning spatio-temporal representation with pseudo-3d residual networks. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.590"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Zhou, B., Andonian, A., Oliva, A., and Torralba, A. (2018, January 8\u201314). Temporal relational reasoning in videos. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01246-5_49"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Xie, S., Sun, C., Huang, J., Tu, Z., and Murphy, K. (2018, January 8\u201314). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01267-0_19"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Lin, J., Gan, C., and Han, S. (November, January 27). Tsm: Temporal shift module for efficient video understanding. Proceedings of the IEEE\/CVF International Conference on Computer Vision, 2019, Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00718"},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"2278","DOI":"10.1109\/5.726791","article-title":"Gradient-based learning applied to document recognition","volume":"86","author":"LeCun","year":"1998","journal-title":"Proc. IEEE"},{"key":"ref_44","unstructured":"Bengio, Y., Bastien, F., Bergeron, A., Boulanger\u2013Lewandowski, N., Breuel, T., Chherawala, Y., Cisse, M., C\u00f4t\u00e9, M., Erhan, D., and Eustache, J. (2011, January 11\u201313). Deep learners benefit more from out-of-distribution examples. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA."},{"key":"ref_45","first-page":"1097","article-title":"Imagenet classification with deep convolutional neural networks","volume":"25","author":"Krizhevsky","year":"2012","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"5858","DOI":"10.1109\/ACCESS.2017.2696121","article-title":"Smart augmentation learning an optimal data augmentation strategy","volume":"5","author":"Lemley","year":"2017","journal-title":"IEEE Access"},{"key":"ref_47","unstructured":"DeVries, T., and Taylor, G.W. (2017). Improved regularization of convolutional neural networks with cutout. arXiv."},{"key":"ref_48","unstructured":"Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., and Yoo, Y. (November, January 27). Cutmix: Regularization strategy to train strong classifiers with localizable features. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Korea."},{"key":"ref_49","unstructured":"Uddin, A., Monira, M., Shin, W., Chung, T., and Bae, S.-H. (2020). SaliencyMix: A Saliency Guided Data Augmentation Strategy for Better Regularization. arXiv."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Gong, C., Wang, D., Li, M., Chandra, V., and Liu, Q. (2020). KeepAugment: A Simple Information-Preserving Data Augmentation Approach. arXiv.","DOI":"10.1109\/CVPR46437.2021.00111"},{"key":"ref_51","doi-asserted-by":"crossref","first-page":"375","DOI":"10.1016\/j.jvcir.2016.10.016","article-title":"Spatio-temporal action localization and detection for human action recognition in big dataset","volume":"41","author":"Megrhi","year":"2016","journal-title":"J. Vis. Commun. Image Represent."},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1016\/j.neucom.2016.09.106","article-title":"Action recognition by saliency-based dense sampling","volume":"236","author":"Xu","year":"2017","journal-title":"Neurocomputing"},{"key":"ref_53","doi-asserted-by":"crossref","first-page":"32","DOI":"10.1016\/j.patcog.2018.01.020","article-title":"Multi-stream CNN: Learning representations based on human-related regions for action recognition","volume":"79","author":"Tu","year":"2018","journal-title":"Pattern Recognit."},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"113203","DOI":"10.1016\/j.eswa.2020.113203","article-title":"Data-level information enhancement: Motion-patch-based Siamese Convolutional Neural Networks for human activity recognition in videos","volume":"147","author":"Zhang","year":"2020","journal-title":"Expert Syst. Appl."},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"1423","DOI":"10.1109\/TCSVT.2018.2830102","article-title":"Semantic cues enhanced multimodality multistream CNN for action recognition","volume":"29","author":"Tu","year":"2018","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_56","doi-asserted-by":"crossref","first-page":"8890808","DOI":"10.1155\/2021\/8890808","article-title":"Attention-Based Temporal Encoding Network with Background-Independent Motion Mask for Action Recognition","volume":"2021","author":"Weng","year":"2021","journal-title":"Comput. Intell. Neurosci."},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Pirinen, A., and Sminchisescu, C. (2018, January 18\u201323). Deep reinforcement learning of region proposal networks for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00726"},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Ren, L., Lu, J., Wang, Z., Tian, Q., and Zhou, J. (2018, January 8\u201314). Collaborative deep reinforcement learning for multi-object tracking. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01219-9_36"},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Li, D., Wu, H., Zhang, J., and Huang, K. (2018, January 18\u201323). A2-RL: Aesthetics aware reinforcement learning for image cropping. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00855"},{"key":"ref_60","unstructured":"Huang, Z., Heng, W., and Zhou, S. (November, January 27). Learning to paint with model-based deep reinforcement learning. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Korea."},{"key":"ref_61","doi-asserted-by":"crossref","unstructured":"Han, J., Yang, L., Zhang, D., Chang, X., and Liang, X. (2018, January 18\u201323). Reinforcement cutting-agent learning for video object segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00946"},{"key":"ref_62","unstructured":"Dong, W., Zhang, Z., and Tan, T. (February, January 27). Attention-aware sampling via deep reinforcement learning for action recognition. Proceedings of the AAAI Conference on Artificial Intelligence, Palo Alto, CA, USA."},{"key":"ref_63","unstructured":"Wu, W., He, D., Tan, X., Chen, S., and Wen, S. (November, January 27). Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Korea."},{"key":"ref_64","doi-asserted-by":"crossref","first-page":"7970","DOI":"10.1109\/TIP.2020.3007826","article-title":"Dynamic sampling networks for efficient action recognition in videos","volume":"29","author":"Zheng","year":"2020","journal-title":"IEEE Trans. Image Process."},{"key":"ref_65","doi-asserted-by":"crossref","unstructured":"Meng, Y., Lin, C.-C., Panda, R., Sattigeri, P., Karlinsky, L., Oliva, A., Saenko, K., and Feris, R. (2020). Ar-net: Adaptive frame resolution for efficient action recognition. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-030-58571-6_6"},{"key":"ref_66","unstructured":"Tang, Y., and Agrawal, S. (2020, January 7\u201312). Discretizing continuous action space for on-policy optimization. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA."},{"key":"ref_67","doi-asserted-by":"crossref","first-page":"106886","DOI":"10.1016\/j.compchemeng.2020.106886","article-title":"A review on reinforcement learning: Introduction and applications in industrial process control","volume":"139","author":"Nian","year":"2020","journal-title":"Comput. Chem. Eng."},{"key":"ref_68","doi-asserted-by":"crossref","unstructured":"Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. (2018, January 2\u20137). Deep reinforcement learning that matters. Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.11694"},{"key":"ref_69","unstructured":"Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv."},{"key":"ref_70","unstructured":"Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016, January 20\u201322). Asynchronous methods for deep reinforcement learning. In Proceeding of the International Conference on Machine Learning, New York, NY, USA."},{"key":"ref_71","unstructured":"Ioffe, S., and Szegedy, C. (2015, January 6\u201311). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Machine Learning, Lille, France."},{"key":"ref_72","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C., Pinz, A., and Zisserman, A. (2016, January 27\u201330). Convolutional two-stream network fusion for video action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.213"},{"key":"ref_73","doi-asserted-by":"crossref","unstructured":"Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. (2017, January 22\u201329). Grad-cam: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.74"},{"key":"ref_74","unstructured":"Tran, D., Ray, J., Shou, Z., Chang, S.-F., and Paluri, M. (2017). Convnet architecture search for spatiotemporal feature learning. arXiv."},{"key":"ref_75","doi-asserted-by":"crossref","unstructured":"Wu, H., Liu, J., Zha, Z.-J., Chen, Z., and Sun, X. (2019, January 10\u201316). Mutually Reinforced Spatio-Temporal Convolutional Tube for Human Action Recognition. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macao, China.","DOI":"10.24963\/ijcai.2019\/136"},{"key":"ref_76","unstructured":"He, D., Zhou, Z., Gan, C., Li, F., Liu, X., Li, Y., Wang, L., and Wen, S. (February, January 27). Stnet: Local and global spatial-temporal modeling for action recognition. Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA."},{"key":"ref_77","doi-asserted-by":"crossref","first-page":"14593","DOI":"10.1007\/s00521-020-05144-7","article-title":"Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition","volume":"32","author":"Liu","year":"2020","journal-title":"Neural Comput. Appl."},{"key":"ref_78","doi-asserted-by":"crossref","first-page":"1347","DOI":"10.1109\/TIP.2017.2778563","article-title":"Recurrent spatial-temporal attention network for action recognition in videos","volume":"27","author":"Du","year":"2017","journal-title":"IEEE Trans. Image Process."},{"key":"ref_79","unstructured":"Jiang, B., Wang, M., Gan, W., Wu, W., and Yan, J. (November, January 27). Stm: Spatiotemporal and motion encoding for action recognition. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Korea."},{"key":"ref_80","doi-asserted-by":"crossref","unstructured":"Liu, Z., Luo, D., Wang, Y., Wang, L., Tai, Y., Wang, C., Li, J., Huang, F., and Lu, T. (2020, January 7\u201312). Teinet: Towards an efficient architecture for video recognition. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA.","DOI":"10.1609\/aaai.v34i07.6836"},{"key":"ref_81","doi-asserted-by":"crossref","unstructured":"Zhou, Y., Sun, X., Luo, C., Zha, Z.-J., and Zeng, W. (2020, January 13\u201319). Spatiotemporal fusion in 3D CNNs: A probabilistic view. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00985"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/14\/4720\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:28:40Z","timestamp":1760164120000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/14\/4720"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,10]]},"references-count":81,"journal-issue":{"issue":"14","published-online":{"date-parts":[[2021,7]]}},"alternative-id":["s21144720"],"URL":"https:\/\/doi.org\/10.3390\/s21144720","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2021,7,10]]}}}