{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T03:03:15Z","timestamp":1760238195006,"version":"build-2065373602"},"reference-count":43,"publisher":"MDPI AG","issue":"14","license":[{"start":{"date-parts":[[2020,7,13]],"date-time":"2020-07-13T00:00:00Z","timestamp":1594598400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100010418","name":"Institute for Information and Communications Technology Promotion","doi-asserted-by":"publisher","award":["2014000077","2017000250"],"award-info":[{"award-number":["2014000077","2017000250"]}],"id":[{"id":"10.13039\/501100010418","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Various action recognition approaches have recently been proposed with the aid of three-dimensional (3D) convolution and a multiple stream structure. However, existing methods are sensitive to background and optical flow noise, which prevents from learning the main object in a video frame. Furthermore, they cannot reflect the accuracy of each stream in the process of combining multiple streams. In this paper, we present a novel action recognition method that improves the existing method using optical flow and a multi-stream structure. The proposed method consists of two parts: (i) optical flow enhancement process using image segmentation and (ii) score fusion process by applying weighted sum of the accuracy. The enhancement process can help the network to efficiently analyze the flow information of the main object in the optical flow frame, thereby improving accuracy. A different accuracy of each stream can be reflected to the fused score while using the proposed score fusion method. We achieved an accuracy of 98.2% on UCF-101 and 82.4% on HMDB-51. The proposed method outperformed many state-of-the-art methods without changing the network structure and it is expected to be easily applied to other networks.<\/jats:p>","DOI":"10.3390\/s20143894","type":"journal-article","created":{"date-parts":[[2020,7,14]],"date-time":"2020-07-14T09:30:49Z","timestamp":1594719049000},"page":"3894","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Enhanced Action Recognition Using Multiple Stream Deep Learning with Optical Flow and Weighted Sum"],"prefix":"10.3390","volume":"20","author":[{"given":"Hyunwoo","family":"Kim","sequence":"first","affiliation":[{"name":"Department of Image, Graduate School of Advanced Imaging Science, Multimedia and Film, Chung-Ang University, Seoul 06974, Korea"}]},{"given":"Seokmok","family":"Park","sequence":"additional","affiliation":[{"name":"Department of Image, Graduate School of Advanced Imaging Science, Multimedia and Film, Chung-Ang University, Seoul 06974, Korea"}]},{"given":"Hyeokjin","family":"Park","sequence":"additional","affiliation":[{"name":"Department of Image, Graduate School of Advanced Imaging Science, Multimedia and Film, Chung-Ang University, Seoul 06974, Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8593-7155","authenticated-orcid":false,"given":"Joonki","family":"Paik","sequence":"additional","affiliation":[{"name":"Department of Image, Graduate School of Advanced Imaging Science, Multimedia and Film, Chung-Ang University, Seoul 06974, Korea"}]}],"member":"1968","published-online":{"date-parts":[[2020,7,13]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014, January 24\u201327). Large-scale video classification with convolutional neural networks. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.223"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015, January 13\u201316). Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.510"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Zach, C., Pock, T., and Bischof, H. (2007). A duality based approach for realtime TV-L 1 optical flow. Joint Pattern Recognition Symposium, Springer.","DOI":"10.1007\/978-3-540-74936-3_22"},{"key":"ref_4","unstructured":"Simonyan, K., and Zisserman, A. (2014, January 8\u201313). Two-stream convolutional networks for action recognition in videos. Proceedings of the Conference on Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Carreira, J., and Zisserman, A. (2017, January 21\u201326). Quo vadis, action recognition? a new model and the kinetics dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.502"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Choutas, V., Weinzaepfel, P., Revaud, J., and Schmid, C. (2018, January 18\u201322). Potion: Pose motion representation for action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00734"},{"key":"ref_7","unstructured":"Wang, L., Koniusz, P., and Huynh, D.Q. (November, January 27). Hallucinating idt descriptors and i3d optical flow features for action recognition with cnns. Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Crasto, N., Weinzaepfel, P., Alahari, K., and Schmid, C. (2019, January 16\u201320). MARS: Motion-augmented RGB stream for action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00807"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Yan, A., Wang, Y., Li, Z., and Qiao, Y. (2019, January 16\u201320). PA3D: Pose-action 3D machine for video recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00811"},{"key":"ref_10","unstructured":"Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., and Natsev, P. (2017). The kinetics human action video dataset. arXiv."},{"key":"ref_11","unstructured":"Sharma, S., Kiros, R., and Salakhutdinov, R. (2015). Action recognition using visual attention. arXiv."},{"key":"ref_12","unstructured":"Girdhar, R., and Ramanan, D. (2017, January 4\u20139). Attentional pooling for action recognition. Proceedings of the Conference on Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_13","unstructured":"Chen, L.C., Papandreou, G., Schroff, F., and Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Dalal, N., Triggs, B., and Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. European Conference on Computer Vision, Springer.","DOI":"10.1007\/11744047_33"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Klaser, A., Marsza\u0142ek, M., and Schmid, C. (2008, January 1\u20134). A spatio-temporal descriptor based on 3d-gradients. Proceedings of the British Machine Vision Conference, Leeds, UK.","DOI":"10.5244\/C.22.99"},{"key":"ref_16","unstructured":"Freeman, W.T., and Roth, M. (1995, January 26\u201328). Orientation histograms for hand gesture recognition. Proceedings of the International Workshop on Automatic Face and Gesture Recognition, Zurich, Switzerland."},{"key":"ref_17","unstructured":"Feichtenhofer, C., Pinz, A., and Zisserman, A. (\u20131, January 26). Convolutional two-stream network fusion for video action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA."},{"key":"ref_18","unstructured":"Nguyen, P.X., Ramanan, D., and Fowlkes, C.C. (November, January 27). Weakly-supervised action localization with background modeling. Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea."},{"key":"ref_19","unstructured":"Liu, Z., Wang, L., Zhang, Q., Gao, Z., Niu, Z., Zheng, N., and Hua, G. (November, January 27). Weakly supervised temporal action localization through contrast based evaluation networks. Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea."},{"key":"ref_20","unstructured":"Zhang, X.Y., Li, C., Shi, H., Zhu, X., Li, P., and Dong, J. (2020). AdapNet: Adaptability decomposing encoder-decoder network for weakly supervised action recognition and localization. IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_21","unstructured":"Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012, January 3\u20138). Imagenet classification with deep convolutional neural networks. Proceedings of the Conference on in Neural Information Processing, Lake Tahoe, NV, USA."},{"key":"ref_22","unstructured":"Ioffe, S., and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T. (2011, January 6\u201313). HMDB: A large video database for human motion recognition. Proceedings of the International Conference on Computer Vision (ICCV), Barcelona, Spain.","DOI":"10.1109\/ICCV.2011.6126543"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Stroud, J., Ross, D., Sun, C., Deng, J., and Sukthankar, R. (2020, January 1\u20135). D3d: Distilled 3d networks for video action recognition. Proceedings of the The IEEE Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA.","DOI":"10.1109\/WACV45572.2020.9093274"},{"key":"ref_25","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the Conference on Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Luong, M.T., Pham, H., and Manning, C.D. (2015). Effective approaches to attention-based neural machine translation. arXiv.","DOI":"10.18653\/v1\/D15-1166"},{"key":"ref_27","unstructured":"Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015, January 7\u20139). Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the International Conference on Machine Learning, Lille, France."},{"key":"ref_28","unstructured":"Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"107037","DOI":"10.1016\/j.patcog.2019.107037","article-title":"Spatio-temporal deformable 3d convnets with attention for action recognition","volume":"98","author":"Li","year":"2020","journal-title":"Pattern Recogn."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Wu, Z., Jiang, Y.G., Wang, X., Ye, H., and Xue, X. (2016, January 15\u201319). Multi-stream multi-class fusion of deep networks for video classification. Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands.","DOI":"10.1145\/2964284.2964328"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"342","DOI":"10.1109\/TPAMI.2007.70796","article-title":"Likelihood ratio-based biometric score fusion","volume":"30","author":"Nandakumar","year":"2007","journal-title":"IEEE Trans. Pattern Anal. Mach. Intel."},{"key":"ref_32","unstructured":"Srivastava, N., Mansimov, E., and Salakhudinov, R. (2015, January 7\u20139). Unsupervised learning of video representations using lstms. Proceedings of the International Conference on Machine Learning, Lille, France."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"1692","DOI":"10.1109\/TPAMI.2015.2461544","article-title":"Moddrop: Adaptive multi-modal gesture recognition","volume":"38","author":"Neverova","year":"2015","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_34","unstructured":"Soomro, K., Zamir, A.R., and Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"98","DOI":"10.1007\/s11263-014-0733-5","article-title":"The Pascal Visual Object Classes Challenge: A Retrospective","volume":"111","author":"Everingham","year":"2015","journal-title":"Int. J. Comput. Vis."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Qiu, Z., Yao, T., Ngo, C.W., Tian, X., and Mei, T. (2019, January 16\u201320). Learning spatio-temporal representation with local and global diffusion. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01233"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"16785","DOI":"10.1109\/ACCESS.2020.2968024","article-title":"Learning Attention-Enhanced Spatiotemporal Representation for Action Recognition","volume":"8","author":"Shi","year":"2020","journal-title":"IEEE Access"},{"key":"ref_38","unstructured":"Jiang, B., Wang, M., Gan, W., Wu, W., and Yan, J. (November, January 27). STM: SpatioTemporal and motion encoding for action recognition. Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea."},{"key":"ref_39","unstructured":"Chi, L., Tian, G., Mu, Y., and Tian, Q. (November, January 27). Two-Stream Video Classification with Cross-Modality Attention. Proceedings of the IEEE International Conference on Computer Vision Workshops, Seoul, Korea."},{"key":"ref_40","unstructured":"Zhu, Y., Lan, Z., Newsam, S., and Hauptmann, A. (2018). Hidden two-stream convolutional networks for action recognition. Asian Conference on Computer Vision, Springer."},{"key":"ref_41","unstructured":"Zhang, J., Shen, F., Xu, X., and Shen, H.T. (2019). Cooperative Cross-Stream Network for Discriminative Action Representation. arXiv."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Diba, A., Fayyaz, M., Sharma, V., Paluri, M., Gall, J., Stiefelhagen, R., and Van Gool, L. (2019). Holistic Large Scale Video Understanding. arXiv.","DOI":"10.1007\/978-3-030-58558-7_35"},{"key":"ref_43","unstructured":"Piergiovanni, A., Angelova, A., Toshev, A., and Ryoo, M.S. (November, January 27). Evolving space-time neural architectures for videos. Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/20\/14\/3894\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T09:50:56Z","timestamp":1760176256000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/20\/14\/3894"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,7,13]]},"references-count":43,"journal-issue":{"issue":"14","published-online":{"date-parts":[[2020,7]]}},"alternative-id":["s20143894"],"URL":"https:\/\/doi.org\/10.3390\/s20143894","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2020,7,13]]}}}