{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T01:37:00Z","timestamp":1760233020212,"version":"build-2065373602"},"reference-count":34,"publisher":"MDPI AG","issue":"24","license":[{"start":{"date-parts":[[2022,12,8]],"date-time":"2022-12-08T00:00:00Z","timestamp":1670457600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key Research and Development Program of China","doi-asserted-by":"publisher","award":["2020YFC1523302","HBU2022ss014","521100221081","62172392","QN2022107"],"award-info":[{"award-number":["2020YFC1523302","HBU2022ss014","521100221081","62172392","QN2022107"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Hebei University","award":["2020YFC1523302","HBU2022ss014","521100221081","62172392","QN2022107"],"award-info":[{"award-number":["2020YFC1523302","HBU2022ss014","521100221081","62172392","QN2022107"]}]},{"name":"National Natural Science Foundation of China","award":["2020YFC1523302","HBU2022ss014","521100221081","62172392","QN2022107"],"award-info":[{"award-number":["2020YFC1523302","HBU2022ss014","521100221081","62172392","QN2022107"]}]},{"name":"Scientific Research Foundation of Colleges and Universities in Hebei Province","award":["2020YFC1523302","HBU2022ss014","521100221081","62172392","QN2022107"],"award-info":[{"award-number":["2020YFC1523302","HBU2022ss014","521100221081","62172392","QN2022107"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>As a typical sequence to sequence task, sign language production (SLP) aims to automatically translate spoken language sentences into the corresponding sign language sequences. The existing SLP methods can be classified into two categories: autoregressive and non-autoregressive SLP. The autoregressive methods suffer from high latency and error accumulation caused by the long-term dependence between current output and the previous poses. And non-autoregressive methods suffer from repetition and omission during the parallel decoding process. To remedy these issues in SLP, we propose a novel method named Pyramid Semi-Autoregressive Transformer with Rich Semantics (PSAT-RS) in this paper. In PSAT-RS, we first introduce a pyramid Semi-Autoregressive mechanism with dividing target sequence into groups in a coarse-to-fine manner, which globally keeps the autoregressive property while locally generating target frames. Meanwhile, the relaxed masked attention mechanism is adopted to make the decoder not only capture the pose sequences in the previous groups, but also pay attention to the current group. Finally, considering the importance of spatial-temporal information, we also design a Rich Semantics embedding (RS) module to encode the sequential information both on time dimension and spatial displacement into the same high-dimensional space. This significantly improves the coordination of joints motion, making the generated sign language videos more natural. Results of our experiments conducted on RWTH-PHOENIX-Weather-2014T and CSL datasets show that the proposed PSAT-RS is competitive to the state-of-the-art autoregressive and non-autoregressive SLP models, achieving a better trade-off between speed and accuracy.<\/jats:p>","DOI":"10.3390\/s22249606","type":"journal-article","created":{"date-parts":[[2022,12,8]],"date-time":"2022-12-08T03:35:53Z","timestamp":1670470553000},"page":"9606","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["A Pyramid Semi-Autoregressive Transformer with Rich Semantics for Sign Language Production"],"prefix":"10.3390","volume":"22","author":[{"given":"Zhenchao","family":"Cui","sequence":"first","affiliation":[{"name":"Hebei Machine Vision Engineering Research Center, School of Cyber Security and Computer, Hebei University, Baoding 071002, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9054-4700","authenticated-orcid":false,"given":"Ziang","family":"Chen","sequence":"additional","affiliation":[{"name":"Hebei Machine Vision Engineering Research Center, School of Cyber Security and Computer, Hebei University, Baoding 071002, China"}]},{"given":"Zhaoxin","family":"Li","sequence":"additional","affiliation":[{"name":"Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China"}]},{"given":"Zhaoqi","family":"Wang","sequence":"additional","affiliation":[{"name":"Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China"}]}],"member":"1968","published-online":{"date-parts":[[2022,12,8]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1016\/j.neunet.2020.01.030","article-title":"Skeleton-based Chinese sign language recognition and generation for bidirectional communication between deaf and hearing people","volume":"125","author":"Xiao","year":"2020","journal-title":"Neural Netw."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Saunders, B., Camgoz, N.C., and Bowden, R. (2020, January 23\u201328). Progressive Transformers for End-to-End Sign Language Production. Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58621-8_40"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"2113","DOI":"10.1007\/s11263-021-01457-9","article-title":"Continuous 3D multi-channel sign language production via progressive transformers and mixture density networks","volume":"129","author":"Saunders","year":"2021","journal-title":"Int. J. Comput. Vis."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Saunders, B., Camgoz, N.C., and Bowden, R. (2021, January 10\u201317). Mixed SIGNals: Sign Language Production via a Mixture of Motion Primitives. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00193"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Tang, S., Hong, R., Guo, D., and Wang, M. (2022, January 10\u201314). Gloss Semantic-Enhanced Network with Online Back-Translation for Sign Language Production. Proceedings of the ACM International Conference on Multimedia (ACM MM), Lisbon, Portugal.","DOI":"10.1145\/3503161.3547830"},{"key":"ref_6","unstructured":"Hwang, E., Kim, J.H., and Park, J.C. (2021, January 22\u201325). Non-Autoregressive Sign Language Production with Gaussian Space. Proceedings of the 32nd British Machine Vision Conference (BMVC 21), British Machine Vision Conference (BMVC), Virtual Event."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Huang, W., Pan, W., Zhao, Z., and Tian, Q. (2021, January 20\u201324). Towards Fast and High-Quality Sign Language Production. Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event.","DOI":"10.1145\/3474085.3475463"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Wang, C., Zhang, J., and Chen, H. (2018). Semi-autoregressive neural machine translation. arXiv.","DOI":"10.18653\/v1\/D18-1044"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Zhang, P., Lan, C., Zeng, W., Xing, J., and Zheng, N. (2020, January 13\u201319). Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00119"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Cui, R., Hu, L., and Zhang, C. (2017, January 21\u201326). Recurrent Convolutional Neural Networks for Continuous Sign Language Recognition by Staged Optimization. Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.175"},{"key":"ref_11","unstructured":"Shi, F., Lee, C., Qiu, L., Zhao, Y., Shen, T., Muralidhar, S., Han, T., Zhu, S.C., and Narayanan, V. (2021). STAR: Sparse Transformer-based Action Recognition. arXiv."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Ghosh, P., Song, J., Aksan, E., and Hilliges, O. (2017, January 10\u201312). Learning human motion models for long-term predictions. Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China.","DOI":"10.1109\/3DV.2017.00059"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Cho, S., Maqbool, M.H., Liu, F., and Foroosh, H. (2019, January 4\u20138). Self-Attention Network for Skeleton-based Human Action Recognition. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACV45572.2020.9093639"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Camgoz, N.C., Hadfield, S., Koller, O., Ney, H., and Bowden, R. (2018, January 18\u201322). Neural Sign Language Translation. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00812"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"2662","DOI":"10.1109\/TMM.2021.3087006","article-title":"Conditional Sentence Generation and Cross-Modal Reranking for Sign Language Translation","volume":"24","author":"Zhao","year":"2022","journal-title":"IEEE Trans. Multimed."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Pu, J., Zhou, W., and Li, H. (2018, January 13\u201319). Dilated convolutional network with iterative optimization for continuous sign language recognition. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18), Stockholm, Sweden.","DOI":"10.24963\/ijcai.2018\/123"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"4433","DOI":"10.1109\/TMM.2021.3117124","article-title":"Graph-Based Multimodal Sequential Embedding for Sign Language Translation","volume":"24","author":"Tang","year":"2022","journal-title":"IEEE Trans. Multimed."},{"key":"ref_18","unstructured":"Camgoz, N.C., Koller, O., Hadfield, S., and Bowden, R. (2020, January 14\u201319). Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA."},{"key":"ref_19","unstructured":"Saunders, B., Camgoz, N.C., and Bowden, R. (2020). Adversarial training for multi-channel sign language production. arXiv."},{"key":"ref_20","unstructured":"Ventura, L., Duarte, A., and Gir\u00f3-i Nieto, X. (2020). Can everybody sign now? Exploring sign language video generation from 2D poses. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Saunders, B., Camg\u00f6z, N.C., and Bowden, R. (2022, January 19\u201324). Signing at Scale: Learning to Co-Articulate Signs for Large-Scale Photo-Realistic Sign Language Production. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00508"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"891","DOI":"10.1007\/s11263-019-01281-2","article-title":"Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks","volume":"128","author":"Stoll","year":"2020","journal-title":"Int. J. Comput. Vis."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"1395","DOI":"10.35940\/ijeat.D7637.049420","article-title":"Neural machine translation using recurrent neural network","volume":"9","author":"Datta","year":"2020","journal-title":"Int. J. Eng. Adv. Technol."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Chen, M.X., Firat, O., Bapna, A., Johnson, M., Macherey, W., Foster, G., Jones, L., Parmar, N., Schuster, M., and Chen, Z. (2018). The best of both worlds: Combining recent advances in neural machine translation. arXiv.","DOI":"10.18653\/v1\/P18-1008"},{"key":"ref_25","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA."},{"key":"ref_26","unstructured":"Wang, Y., Tian, F., Di, H., Tao, Q., and Liu, T.Y. (February, January 27). Non-Autoregressive Machine Translation with Auxiliary Regularization. Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), Honolulu, HI, USA."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Lee, J., Mansimov, E., and Cho, K. (2018). Deterministic non-autoregressive neural sequence modeling by iterative refinement. arXiv.","DOI":"10.18653\/v1\/D18-1149"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Zhou, Y., Zhang, Y., Hu, Z., and Wang, M. (2021, January 11\u201317). Semi-Autoregressive Transformer for Image Captioning. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Virtual Event.","DOI":"10.1109\/ICCVW54120.2021.00350"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Wang, M., Jiaxin, G., Wang, Y., Chen, Y., Chang, S., Shang, H., Zhang, M., Tao, S., and Yang, H. (2021, January 11). How Length Prediction Influence the Performance of Non-Autoregressive Translation?. Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, Punta Cana, Dominican Republic.","DOI":"10.18653\/v1\/2021.blackboxnlp-1.14"},{"key":"ref_30","unstructured":"Forster, J., Schmidt, C., Koller, O., Bellgardt, M., and Ney, H. (2014, January 26\u201331). Extensions of the sign language recognition and translation corpus RWTH-PHOENIX-Weather. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC\u201914), Reykjavik, Iceland."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Pu, J., Zhou, W., and Li, H. (2019, January 16\u201320). Iterative Alignment Network for Continuous Sign Language Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00429"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Chen, W., Jiang, Z., Guo, H., and Ni, X. (2020). Fall detection based on key points of human-skeleton using openpose. Symmetry, 12.","DOI":"10.3390\/sym12050744"},{"key":"ref_33","unstructured":"Kingma, D., and Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv."},{"key":"ref_34","unstructured":"Gotmare, A., Keskar, N.S., Xiong, C., and Socher, R. (2018). A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation. arXiv."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/24\/9606\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T01:36:09Z","timestamp":1760146569000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/24\/9606"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,12,8]]},"references-count":34,"journal-issue":{"issue":"24","published-online":{"date-parts":[[2022,12]]}},"alternative-id":["s22249606"],"URL":"https:\/\/doi.org\/10.3390\/s22249606","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2022,12,8]]}}}