{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,22]],"date-time":"2026-01-22T22:30:30Z","timestamp":1769121030385,"version":"3.49.0"},"reference-count":45,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2020,7,15]],"date-time":"2020-07-15T00:00:00Z","timestamp":1594771200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Key R&amp;D Program of Guangdong Province","award":["2018B010107005"],"award-info":[{"award-number":["2018B010107005"]}]},{"name":"the Natural Science Foundation of Guangdong Province","award":["2016A030313288"],"award-info":[{"award-number":["2016A030313288"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Algorithms"],"abstract":"<jats:p>Modeling spatiotemporal representations is one of the most essential yet challenging issues in video action recognition. Existing methods lack the capacity to accurately model either the correlations between spatial and temporal features or the global temporal dependencies. Inspired by the two-stream network for video action recognition, we propose an encoder\u2013decoder framework named Two-Stream Bidirectional Long Short-Term Memory (LSTM) Residual Network (TBRNet) which takes advantage of the interaction between spatiotemporal representations and global temporal dependencies. In the encoding phase, the two-stream architecture, based on the proposed Residual Convolutional 3D (Res-C3D) network, extracts features with residual connections inserted between the two pathways, and then the features are fused to become the short-term spatiotemporal features of the encoder. In the decoding phase, those short-term spatiotemporal features are first fed into a temporal attention-based bidirectional LSTM (BiLSTM) network to obtain long-term bidirectional attention-pooling dependencies. Subsequently, those temporal dependencies are integrated with short-term spatiotemporal features to obtain global spatiotemporal relationships. On two benchmark datasets, UCF101 and HMDB51, we verified the effectiveness of our proposed TBRNet by a series of experiments, and it achieved competitive or even better results compared with existing state-of-the-art approaches.<\/jats:p>","DOI":"10.3390\/a13070169","type":"journal-article","created":{"date-parts":[[2020,7,15]],"date-time":"2020-07-15T10:35:18Z","timestamp":1594809318000},"page":"169","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":13,"title":["TBRNet: Two-Stream BiLSTM Residual Network for Video Action Recognition"],"prefix":"10.3390","volume":"13","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9153-9065","authenticated-orcid":false,"given":"Xiao","family":"Wu","sequence":"first","affiliation":[{"name":"School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510006, China"},{"name":"Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou 510006, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1207-2410","authenticated-orcid":false,"given":"Qingge","family":"Ji","sequence":"additional","affiliation":[{"name":"School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510006, China"},{"name":"Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou 510006, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2020,7,15]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1155","DOI":"10.1109\/ACCESS.2017.2778011","article-title":"Action recognition in video sequences using deep bi-directional LSTM with CNN features","volume":"6","author":"Ullah","year":"2017","journal-title":"IEEE Access"},{"key":"ref_2","unstructured":"Simonyan, K., and Zisserman, A. (2014, January 8\u201313). Two-stream convolutional networks for action recognition in videos. Proceedings of the Advances in Neural Information Processing Systems, Montreal, Canada."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C., Pinz, A., and Zisserman, A. (2016, January 27\u201330). Convolutional two-stream network fusion for video action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.213"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015, January 11\u201318). Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.510"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., and Toderici, G. (2015, January 7\u201312). Beyond short snippets: Deep networks for video classification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299101"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., and Darrell, T. (2015, January 7\u201312). Long-term recurrent convolutional networks for visual recognition and description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298878"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., and Rabinovich, A. (2015, January 7\u201312). Going deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A.A. (2017, January 4\u20139). Inception-v4, inception-resnet and the impact of residual connections on learning. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA.","DOI":"10.1609\/aaai.v31i1.11231"},{"key":"ref_9","unstructured":"Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012, January 3\u20136). Imagenet classification with deep convolutional neural networks. Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, CA, USA."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014, January 23\u201328). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.81"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014, January 23\u201328). Large-scale video classification with convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.223"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1109\/TPAMI.2012.59","article-title":"3D convolutional neural networks for human action recognition","volume":"Volume 35","author":"Ji","year":"2012","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Scovanner, P., Ali, S., and Shah, M. (2007, January 24\u201329). A 3-dimensional sift descriptor and its application to action recognition. Proceedings of the 15th ACM International Conference on Multimedia, Augsburg, Germany.","DOI":"10.1145\/1291233.1291311"},{"key":"ref_14","unstructured":"Hu, Y., Cao, L., Lv, F., Yan, S., Gong, Y., and Huang, T.S. (October, January 27). Action detection in complex scenes with spatial and temporal ambiguities. Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan."},{"key":"ref_15","unstructured":"Dalal, N., Triggs, B., and Schmid, C. (2016, January 7\u201313). Human detection using oriented histograms of flow and appearance. Proceedings of the European Conference on Computer Vision, Graz, Austria."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Laptev, I., Marszalek, M., Schmid, C., and Rozenfeld, B. (2008, January 24\u201326). Learning realistic human actions from movies. Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA.","DOI":"10.1109\/CVPR.2008.4587756"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Wang, H., and Schmid, C. (2013, January 1\u20138). Action recognition with improved trajectories. Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia.","DOI":"10.1109\/ICCV.2013.441"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1016\/j.patrec.2017.06.010","article-title":"Aggregating the temporal coherent descriptors in videos using multiple learning kernel for action recognition","volume":"105","author":"Saleh","year":"2018","journal-title":"Pattern. Recogn. Lett."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., and Van Gool, L. (2016, January 11\u201314). Temporal segment networks: Towards good practices for deep action recognition. Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46484-8_2"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"773","DOI":"10.1109\/TPAMI.2016.2558148","article-title":"Rank pooling for action recognition","volume":"Volume 39","author":"Fernando","year":"2016","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Carreira, J., and Zisserman, A. (2017, January 21\u201326). Quo vadis, action recognition? A new model and the kinetics dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.502"},{"key":"ref_22","unstructured":"Soomro, K., Zamir, A.R., and Shah, M. (2012, December 03). UCF101: A dataset of 101 Human Actions Classes from Videos in the Wild. Available online: https:\/\/arxiv.org\/pdf\/1212.0402."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T. (2011, January 6\u201313). HMDB: A large video database for human motion recognition. Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain.","DOI":"10.1109\/ICCV.2011.6126543"},{"key":"ref_24","unstructured":"Wang, H., Kl\u00e4ser, A., Schmid, C., and Liu, C.L. (2011, January 20\u201325). Action recognition by dense trajectories. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA."},{"key":"ref_25","unstructured":"Lazebnik, S., Schmid, C., and Ponce, J. (2006, January 17\u201322). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New York, NY, USA."},{"key":"ref_26","unstructured":"Tran, D., Ray, J., Shou, Z., Chang, S.F., and Paluri, M. (2017, August 16). Convnet Architecture Search for Spatiotemporal Feature Learning. Available online: https:\/\/arxiv.org\/pdf\/1708.05038."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., and Saenko, K. (2015, January 7\u201312). Sequence to sequence-video to text. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/ICCV.2015.515"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"1510","DOI":"10.1109\/TPAMI.2017.2712608","article-title":"Long-term temporal convolutions for action recognition","volume":"Volume 40","author":"Varol","year":"2017","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"ref_29","unstructured":"Ba, J., Mnih, V., and Kavukcuoglu, K. (2015, April 23). Multiple Object Recognition with Visual Attention. Available online: https:\/\/arxiv.org\/pdf\/1412.7755."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Hu, J., Shen, L., and Sun, G. (2018, January 18\u201322). Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00745"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., and Tang, X. (2017, January 21\u201326). Residual attention network for image classification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.683"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Kar, A., Rai, N., Sikka, K., and Sharma, G. (2017, January 21\u201326). Adascan: Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.604"},{"key":"ref_33","unstructured":"Korbar, B., Tran, D., and Torresani, L. (November, January 27). SCSampler: Sampling salient clips from video for efficient action recognition. Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"118388","DOI":"10.1109\/ACCESS.2019.2936628","article-title":"Two-Level Attention Model Based Video Action Recognition Network","volume":"7","author":"Sang","year":"2019","journal-title":"IEEE Access"},{"key":"ref_35","unstructured":"Sharma, S., Kiros, R., and Salakhutdinov, R. (2016, February 14). Action Recognition Using Visual Attention. Available online: https:\/\/arxiv.org\/pdf\/1511.04119."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"226","DOI":"10.1016\/j.patrec.2018.07.034","article-title":"Joint spatial-temporal attention for action recognition","volume":"112","author":"Yu","year":"2018","journal-title":"Pattern Recognit. Lett."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C., Pinz, A., and Wildes, R.P. (2016, January 5\u201310). Spatiotemporal residual networks for video action recognition. Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain.","DOI":"10.1109\/CVPR.2017.787"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C., Pinz, A., and Wildes, R.P. (2017, January 21\u201326). Spatiotemporal multiplier networks for video action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.787"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Qiu, Z., Yao, T., and Mei, T. (2017, January 22\u201329). Learning spatio-temporal representation with pseudo-3d residual networks. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.590"},{"key":"ref_41","unstructured":"Simonyan, K., and Zisserman, A. (2015, April 10). Very Deep Convolutional Networks for Large-Scale Image Recognition. Available online: https:\/\/arxiv.org\/pdf\/1409.1556."},{"key":"ref_42","unstructured":"Srivastava, R.K., Greff, K., and Schmidhuber, J. (2015, January 7\u201312). Training very deep networks. Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_43","unstructured":"Srivastava, R.K., Greff, K., and Schmidhuber, J. (2015, November 03). Highway Networks. Available online: https:\/\/arxiv.org\/pdf\/1505.00387."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Wang, L., Qiao, Y., and Tang, X. (2015, January 7\u201312). Action recognition with trajectory-pooled deep-convolutional descriptors. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299059"},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"375","DOI":"10.1007\/s11263-017-1013-y","article-title":"Every moment counts: Dense detailed labeling of actions in complex videos","volume":"126","author":"Yeung","year":"2018","journal-title":"Int. J. Comput. Vis."}],"container-title":["Algorithms"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-4893\/13\/7\/169\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T09:51:34Z","timestamp":1760176294000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-4893\/13\/7\/169"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,7,15]]},"references-count":45,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2020,7]]}},"alternative-id":["a13070169"],"URL":"https:\/\/doi.org\/10.3390\/a13070169","relation":{},"ISSN":["1999-4893"],"issn-type":[{"value":"1999-4893","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,7,15]]}}}