{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,19]],"date-time":"2026-02-19T02:16:05Z","timestamp":1771467365620,"version":"3.50.1"},"reference-count":55,"publisher":"Institute of Electronics, Information and Communications Engineers (IEICE)","issue":"11","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["IEICE Trans. Inf. &amp; Syst."],"published-print":{"date-parts":[[2025,11,1]]},"DOI":"10.1587\/transinf.2024edp7297","type":"journal-article","created":{"date-parts":[[2025,5,13]],"date-time":"2025-05-13T18:07:34Z","timestamp":1747159654000},"page":"1359-1372","source":"Crossref","is-referenced-by-count":1,"title":["Revisiting I3D for Sign Language Recognition"],"prefix":"10.1587","volume":"E108.D","author":[{"given":"Shrey","family":"SINGH","sequence":"first","affiliation":[{"name":"Indian Institute of Technology"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Prateek","family":"KESERWANI","sequence":"additional","affiliation":[{"name":"Indian Institute of Technology"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Katsufumi","family":"INOUE","sequence":"additional","affiliation":[{"name":"Osaka Metropolitan University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Masakazu","family":"IWAMURA","sequence":"additional","affiliation":[{"name":"Osaka Metropolitan University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Partha Pratim","family":"ROY","sequence":"additional","affiliation":[{"name":"Indian Institute of Technology"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"532","reference":[{"key":"1","doi-asserted-by":"crossref","unstructured":"[1] K. Emmorey, \u201cLanguage, cognition, and the brain: Insights from sign language research,\u201d Psychology Press, 2001. 10.4324\/9781410603982","DOI":"10.4324\/9781410603982"},{"key":"2","unstructured":"[2] J. Murray, K. Kraus, E. Down, R. Adam, K. Snoddon, and D.J. Napoli, \u201cWorld Federation of the Deaf position paper on the language rights of deaf children,\u201d Helsinki, 2016."},{"key":"3","doi-asserted-by":"crossref","unstructured":"[3] T. Johnston and A. Schembri, \u201cAustralian Sign Language (Auslan): An introduction to sign language linguistics,\u201d Cambridge University Press, 2007. 10.1017\/cbo9780511607479","DOI":"10.1017\/CBO9780511607479"},{"key":"4","doi-asserted-by":"crossref","unstructured":"[4] Q. Yang, \u201cChinese sign language recognition based on video sequence appearance modeling,\u201d 2010 5th IEEE Conference on Industrial Electronics and Applications, pp.1537-1542, 2010. 10.1109\/iciea.2010.5514688","DOI":"10.1109\/ICIEA.2010.5514688"},{"key":"5","unstructured":"[5] A. Mindess, \u201cIntercultural communication for sign language interpreters,\u201d A celebration of the profession: Proceedings of the Fourteenth National Convention of the Registry of Interpreters for the Deaf, pp.69-83, Registry of Interpreters for the Deaf Publications Silver Spring, MD, 1996."},{"key":"6","doi-asserted-by":"publisher","unstructured":"[6] P. Keserwani, R. Saini, M. Liwicki, and P.P. Roy, \u201cRobust scene text detection for partially annotated training data,\u201d IEEE Transactions on Circuits and Systems for Video Technology, vol.32, no.12, pp.8635-8645, 2022. 10.1109\/tcsvt.2022.3194835","DOI":"10.1109\/TCSVT.2022.3194835"},{"key":"7","doi-asserted-by":"publisher","unstructured":"[7] Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, \u201cObject detection in 20 years: A survey,\u201d Proceedings of the IEEE, vol.111, no.3, pp.257-276, 2023. 10.1109\/jproc.2023.3238524","DOI":"10.1109\/JPROC.2023.3238524"},{"key":"8","doi-asserted-by":"crossref","unstructured":"[8] D. Li, C. Rodriguez, X. Yu, and H. Li, \u201cWord-level deep sign language recognition from video: A new large-scale dataset and methods comparison,\u201d Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, pp.1459-1469, 2020. 10.1109\/wacv45572.2020.9093512","DOI":"10.1109\/WACV45572.2020.9093512"},{"key":"9","doi-asserted-by":"publisher","unstructured":"[9] R. Cui, H. Liu, and C. Zhang, \u201cA deep neural framework for continuous sign language recognition by iterative training,\u201d IEEE Transactions on Multimedia, vol.21, no.7, pp.1880-1891, 2019. 10.1109\/tmm.2018.2889563","DOI":"10.1109\/TMM.2018.2889563"},{"key":"10","doi-asserted-by":"crossref","unstructured":"[10] M. Boh\u00e1\u010dek and M. Hr\u00faz, \u201cSign pose-based transformer for word-level sign language recognition,\u201d Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, pp.182-191, 2022. 10.1109\/wacvw54805.2022.00024","DOI":"10.1109\/WACVW54805.2022.00024"},{"key":"11","doi-asserted-by":"crossref","unstructured":"[11] A. Yin, T. Zhong, L. Tang, W. Jin, T. Jin, and Z. Zhao, \u201cGloss attention for gloss-free sign language translation,\u201d Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp.2551-2562, 2023. 10.1109\/cvpr52729.2023.00251","DOI":"10.1109\/CVPR52729.2023.00251"},{"key":"12","doi-asserted-by":"publisher","unstructured":"[12] N. Habili, C.C. Lim, and A. Moini, \u201cSegmentation of the face and hands in sign language video sequences using color and motion cues,\u201d IEEE Transactions on Circuits and Systems for Video Technology, vol.14, no.8, pp.1086-1097, 2004. 10.1109\/tcsvt.2004.831970","DOI":"10.1109\/TCSVT.2004.831970"},{"key":"13","doi-asserted-by":"publisher","unstructured":"[13] R. Rastgoo, K. Kiani, and S. Escalera, \u201cSign language recognition: A deep survey,\u201d Expert Systems with Applications, vol.164, p.113794, 2021. 10.1016\/j.eswa.2020.113794","DOI":"10.1016\/j.eswa.2020.113794"},{"key":"14","doi-asserted-by":"crossref","unstructured":"[14] J. Carreira and A. Zisserman, \u201cQuo vadis, action recognition? a new model and the kinetics dataset,\u201d Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.6299-6308, 2017. 10.1109\/cvpr.2017.502","DOI":"10.1109\/CVPR.2017.502"},{"key":"15","doi-asserted-by":"crossref","unstructured":"[15] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, \u201cGoing deeper with convolutions,\u201d Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1-9, 2015. 10.1109\/cvpr.2015.7298594","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"16","doi-asserted-by":"crossref","unstructured":"[16] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, \u201cImage captioning with semantic attention,\u201d Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.4651-4659, 2016. 10.1109\/cvpr.2016.503","DOI":"10.1109\/CVPR.2016.503"},{"key":"17","doi-asserted-by":"publisher","unstructured":"[17] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, \u201cAster: An attentional scene text recognizer with flexible rectification,\u201d IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.41, no.9, pp.2035-2048, 2018. 10.1109\/tpami.2018.2848939","DOI":"10.1109\/TPAMI.2018.2848939"},{"key":"18","doi-asserted-by":"publisher","unstructured":"[18] A.K. Bhunia, A. Konwer, A.K. Bhunia, A. Bhowmick, P.P. Roy, and U. Pal, \u201cScript identification in natural scene image and video frames using an attention based convolutional-lstm network,\u201d Pattern Recognition, vol.85, pp.172-184, 2019. 10.1016\/j.patcog.2018.07.034","DOI":"10.1016\/j.patcog.2018.07.034"},{"key":"19","doi-asserted-by":"publisher","unstructured":"[19] T. Starner, J. Weaver, and A. Pentland, \u201cReal-time american sign language recognition using desk and wearable computer based video,\u201d IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.20, no.12, pp.1371-1375, 1998. 10.1109\/34.735811","DOI":"10.1109\/34.735811"},{"key":"20","doi-asserted-by":"crossref","unstructured":"[20] C. Vogler and D. Metaxas, \u201cParallel hidden markov models for american sign language recognition,\u201d Proceedings of the Seventh IEEE International Conference on Computer Vision, IEEE, 1999. 10.1109\/iccv.1999.791206","DOI":"10.1109\/ICCV.1999.791206"},{"key":"21","doi-asserted-by":"crossref","unstructured":"[21] J. Pu, W. Zhou, J. Zhang, and H. Li, \u201cSign language recognition based on trajectory modeling with hmms,\u201d International Conference on Multimedia Modeling, pp.686-697, Springer, 2016. 10.1007\/978-3-319-27671-7_58","DOI":"10.1007\/978-3-319-27671-7_58"},{"key":"22","doi-asserted-by":"publisher","unstructured":"[22] C. Oz and M.C. Leu, \u201cLinguistic properties based on american sign language isolated word recognition with artificial neural networks using a sensory glove and motion tracker,\u201d Neurocomputing, vol.70, no.16-18, pp.2891-2901, 2007. 10.1016\/j.neucom.2006.04.016","DOI":"10.1016\/j.neucom.2006.04.016"},{"key":"23","doi-asserted-by":"crossref","unstructured":"[23] Y. Lin, X. Chai, Y. Zhou, and X. Chen, \u201cCurve matching from the view of manifold for sign language recognition,\u201d Asian Conference on Computer Vision, pp.233-246, 2014. 10.1007\/978-3-319-16634-6_18","DOI":"10.1007\/978-3-319-16634-6_18"},{"key":"24","doi-asserted-by":"publisher","unstructured":"[24] L.-C. Wang, R. Wang, D.-H. Kong, and B.-C. Yin, \u201cSimilarity assessment model for chinese sign language videos,\u201d IEEE Transactions on Multimedia, vol.16, no.3, pp.751-761, 2014. 10.1109\/tmm.2014.2298382","DOI":"10.1109\/TMM.2014.2298382"},{"key":"25","doi-asserted-by":"publisher","unstructured":"[25] H. Hikawa and K. Kaida, \u201cNovel FPGA implementation of hand sign recognition system with SOM-hebb classifier,\u201d IEEE Transactions on Circuits and Systems for Video Technology, vol.25, no.1, pp.153-166, 2014. 10.1109\/tcsvt.2014.2335831","DOI":"10.1109\/TCSVT.2014.2335831"},{"key":"26","doi-asserted-by":"crossref","unstructured":"[26] P. Molchanov, S. Gupta, K. Kim, and J. Kautz, \u201cHand gesture recognition with 3d convolutional neural networks,\u201d Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp.1-7, 2015. 10.1109\/cvprw.2015.7301342","DOI":"10.1109\/CVPRW.2015.7301342"},{"key":"27","doi-asserted-by":"crossref","unstructured":"[27] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz, \u201cOnline detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network,\u201d Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.4207-4215, 2016. 10.1109\/cvpr.2016.456","DOI":"10.1109\/CVPR.2016.456"},{"key":"28","doi-asserted-by":"publisher","unstructured":"[28] L. Pigou, A. Van Den Oord, S. Dieleman, M. Van Herreweghe, and J. Dambre, \u201cBeyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video,\u201d International Journal of Computer Vision, vol.126, no.2, pp.430-439, 2018. 10.1007\/s11263-016-0957-7","DOI":"10.1007\/s11263-016-0957-7"},{"key":"29","doi-asserted-by":"crossref","unstructured":"[29] N. Neverova, C. Wolf, G.W. Taylor, and F. Nebout, \u201cMulti-scale deep learning for gesture detection and localization,\u201d European Conference on Computer Vision, pp.474-490, 2014. 10.1007\/978-3-319-16178-5_33","DOI":"10.1007\/978-3-319-16178-5_33"},{"key":"30","doi-asserted-by":"crossref","unstructured":"[30] N. Nishida and H. Nakayama, \u201cMultimodal gesture recognition using multi-stream recurrent neural network,\u201d Image and Video Technology, pp.682-694, 2015. 10.1007\/978-3-319-29451-3_54","DOI":"10.1007\/978-3-319-29451-3_54"},{"key":"31","doi-asserted-by":"publisher","unstructured":"[31] D. Wu, L. Pigou, P.-J. Kindermans, N.D.-H. Le, L. Shao, J. Dambre, and J.-M. Odobez, \u201cDeep dynamic neural networks for multimodal gesture segmentation and recognition,\u201d IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.38, no.8, pp.1583-1597, 2016. 10.1109\/tpami.2016.2537340","DOI":"10.1109\/TPAMI.2016.2537340"},{"key":"32","doi-asserted-by":"publisher","unstructured":"[32] K. Fu, J. Jin, R. Cui, F. Sha, and C. Zhang, \u201cAligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts,\u201d IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.39, no.12, pp.2321-2334, 2016. 10.1109\/tpami.2016.2642953","DOI":"10.1109\/TPAMI.2016.2642953"},{"key":"33","doi-asserted-by":"crossref","unstructured":"[33] T.-Y. Lin, A. RoyChowdhury, and S. Maji, \u201cBilinear cnn models for fine-grained visual recognition,\u201d Proceedings of the IEEE International Conference on Computer Vision, pp.1449-1457, 2015. 10.1109\/iccv.2015.170","DOI":"10.1109\/ICCV.2015.170"},{"key":"34","unstructured":"[34] J. Ba, V. Mnih, and K. Kavukcuoglu, \u201cMultiple object recognition with visual attention,\u201d arXiv preprint arXiv:1412.7755, 2014."},{"key":"35","unstructured":"[35] V. Mnih, N. Heess, A. Graves, and K. kavukcuoglu, \u201cRecurrent models of visual attention,\u201d Advances in Neural Information Processing Systems, pp.2204-2212, 2014."},{"key":"36","doi-asserted-by":"publisher","unstructured":"[36] J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li, \u201cVideo-based sign language recognition without temporal segmentation,\u201d Thirty-Second AAAI Conference on Artificial Intelligence, vol.32, no.1, 2018. 10.1609\/aaai.v32i1.11903","DOI":"10.1609\/aaai.v32i1.11903"},{"key":"37","doi-asserted-by":"publisher","unstructured":"[37] J. Huang, W. Zhou, H. Li, and W. Li, \u201cAttention-based 3d-cnns for large-vocabulary sign language recognition,\u201d IEEE Transactions on Circuits and Systems for Video Technology, vol.29, no.9, pp.2822-2832, 2018. 10.1109\/tcsvt.2018.2870740","DOI":"10.1109\/TCSVT.2018.2870740"},{"key":"38","doi-asserted-by":"publisher","unstructured":"[38] Y. Li, X. Wang, W. Liu, and B. Feng, \u201cDeep attention network for joint hand gesture localization and recognition using static rgb-d images,\u201d Information Sciences, vol.441, pp.66-78, 2018. 10.1016\/j.ins.2018.02.024","DOI":"10.1016\/j.ins.2018.02.024"},{"key":"39","doi-asserted-by":"crossref","unstructured":"[39] B. Shi, A.M. Del Rio, J. Keane, J. Michaux, D. Brentari, G. Shakhnarovich, and K. Livescu, \u201cAmerican sign language fingerspelling recognition in the wild,\u201d 2018 IEEE Spoken Language Technology Workshop, pp.145-152, 2018. 10.1109\/slt.2018.8639639","DOI":"10.1109\/SLT.2018.8639639"},{"key":"40","doi-asserted-by":"publisher","unstructured":"[40] S. Alyami, H. Luqman, and M. Hammoudeh, \u201cIsolated arabic sign language recognition using a transformer-based model and landmark keypoints,\u201d ACM Transactions on Asian and Low-Resource Language Information Processing, vol.23, no.1, pp.1-19, 2024. 10.1145\/3584984","DOI":"10.1145\/3584984"},{"key":"41","doi-asserted-by":"publisher","unstructured":"[41] N. Naz, H. Sajid, S. Ali, O. Hasan, and M.K. Ehsan, \u201cSigngraph: An efficient and accurate pose-based graph convolution approach toward sign language recognition,\u201d IEEE Access, vol.11, pp.19135-19147, 2023. 10.1109\/access.2023.3247761","DOI":"10.1109\/ACCESS.2023.3247761"},{"key":"42","unstructured":"[42] K.M. Dafnis, \u201cBidirectional skeleton-based isolated sign recognition using graph convolution networks.,\u201d LREC Proceedings, 2022."},{"key":"43","doi-asserted-by":"publisher","unstructured":"[43] D.R. Kothadiya, C.M. Bhatt, T. Saba, A. Rehman, and S.A. Bahaj, \u201cSignformer: deepvision transformer for sign language recognition,\u201d IEEE Access, vol.11, pp.4730-4739, 2023. 10.1109\/access.2022.3231130","DOI":"10.1109\/ACCESS.2022.3231130"},{"key":"44","unstructured":"[44] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., \u201cAn image is worth 16x16 words: Transformers for image recognition at scale,\u201d arXiv preprint arXiv:2010.11929, 2020."},{"key":"45","doi-asserted-by":"crossref","unstructured":"[45] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, and L. Fei-Fei, \u201cImagenet large scale visual recognition challenge,\u201d International Journal of Computer Vision, vol.115, no.3, pp.211-252, 2015. 10.1007\/s11263-015-0816-y","DOI":"10.1007\/s11263-015-0816-y"},{"key":"46","doi-asserted-by":"crossref","unstructured":"[46] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei, \u201cImagenet: A large-scale hierarchical image database,\u201d IEEE Conference on Computer Vision and Pattern Recognition, pp.248-255, IEEE, 2009. 10.1109\/cvpr.2009.5206848","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"47","doi-asserted-by":"publisher","unstructured":"[47] S. Minaee, Y.Y. Boykov, F. Porikli, A.J. Plaza, N. Kehtarnavaz, and D. Terzopoulos, \u201cImage segmentation using deep learning: A survey,\u201d IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. 10.1109\/tpami.2021.3059968","DOI":"10.1109\/TPAMI.2021.3059968"},{"key":"48","unstructured":"[48] K. Simonyan and A. Zisserman, \u201cVery deep convolutional networks for large-scale image recognition,\u201d arXiv preprint arXiv:1409.1556, 2014."},{"key":"49","doi-asserted-by":"crossref","unstructured":"[49] K. He, X. Zhang, S. Ren, and J. Sun, \u201cDeep residual learning for image recognition,\u201d Proceedings of the IEEE Conference on Computer Vision and Pattern recognition, pp.770-778, 2016. 10.1109\/cvpr.2016.90","DOI":"10.1109\/CVPR.2016.90"},{"key":"50","doi-asserted-by":"crossref","unstructured":"[50] G. Huang, Z. Liu, L. Van Der Maaten, and K.Q. Weinberger, \u201cDensely connected convolutional networks,\u201d Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.4700-4708, 2017. 10.1109\/cvpr.2017.243","DOI":"10.1109\/CVPR.2017.243"},{"key":"51","unstructured":"[51] H.R. Vaezi Joze and O. Koller, \u201cMS-ASL: A large-scale data set and benchmark for understanding american sign language,\u201d Proceedings of the British Machine Vision Conference, 2019."},{"key":"52","doi-asserted-by":"crossref","unstructured":"[52] A. Tunga, S.V. Nuthalapati, and J.P. Wachs, \u201cPose-based sign language recognition using gcn and bert.,\u201d Proceedings of the Winter Conference on Applications of Computer Vision (Workshops), pp.31-40, 2021. 10.1109\/wacvw52041.2021.00008","DOI":"10.1109\/WACVW52041.2021.00008"},{"key":"53","doi-asserted-by":"crossref","unstructured":"[53] D. Li, X. Yu, C. Xu, L. Petersson, and H. Li, \u201cTransferring cross-domain knowledge for video sign language recognition,\u201d Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.6205-6214, 2020. 10.1109\/cvpr42600.2020.00624","DOI":"10.1109\/CVPR42600.2020.00624"},{"key":"54","unstructured":"[54] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., \u201cPyTorch: An imperative style, high-performance deep learning library,\u201d Advances in Neural Information Processing Systems, vol.32, pp.8026-8037, 2019."},{"key":"55","unstructured":"[55] L. Van der Maaten and G. Hinton, \u201cVisualizing data using t-SNE.,\u201d Journal of Machine Learning Research, vol.9, no.11, 2008."}],"container-title":["IEICE Transactions on Information and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.jstage.jst.go.jp\/article\/transinf\/E108.D\/11\/E108.D_2024EDP7297\/_pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,1]],"date-time":"2025-11-01T03:36:11Z","timestamp":1761968171000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.jstage.jst.go.jp\/article\/transinf\/E108.D\/11\/E108.D_2024EDP7297\/_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,1]]},"references-count":55,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2025]]}},"URL":"https:\/\/doi.org\/10.1587\/transinf.2024edp7297","relation":{},"ISSN":["0916-8532","1745-1361"],"issn-type":[{"value":"0916-8532","type":"print"},{"value":"1745-1361","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,1]]},"article-number":"2024EDP7297"}}