{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,2]],"date-time":"2025-11-02T11:08:05Z","timestamp":1762081685625,"version":"build-2065373602"},"reference-count":34,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2022,12,26]],"date-time":"2022-12-26T00:00:00Z","timestamp":1672012800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key Research and Development Program of China","doi-asserted-by":"publisher","award":["2022YFC2405600","81871444","62071241","62075098","62001240","BK20192004D","BE2022160","KYCX21_1557"],"award-info":[{"award-number":["2022YFC2405600","81871444","62071241","62075098","62001240","BK20192004D","BE2022160","KYCX21_1557"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"NSFC","doi-asserted-by":"publisher","award":["2022YFC2405600","81871444","62071241","62075098","62001240","BK20192004D","BE2022160","KYCX21_1557"],"award-info":[{"award-number":["2022YFC2405600","81871444","62071241","62075098","62001240","BK20192004D","BE2022160","KYCX21_1557"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Leading-edge Technology and Basic Research Program of Jiangsu","award":["2022YFC2405600","81871444","62071241","62075098","62001240","BK20192004D","BE2022160","KYCX21_1557"],"award-info":[{"award-number":["2022YFC2405600","81871444","62071241","62075098","62001240","BK20192004D","BE2022160","KYCX21_1557"]}]},{"name":"Key Research and Development Program of Jiangsu","award":["2022YFC2405600","81871444","62071241","62075098","62001240","BK20192004D","BE2022160","KYCX21_1557"],"award-info":[{"award-number":["2022YFC2405600","81871444","62071241","62075098","62001240","BK20192004D","BE2022160","KYCX21_1557"]}]},{"name":"Postgraduate Research and Practice Innovation Program of Jiangsu Province","award":["2022YFC2405600","81871444","62071241","62075098","62001240","BK20192004D","BE2022160","KYCX21_1557"],"award-info":[{"award-number":["2022YFC2405600","81871444","62071241","62075098","62001240","BK20192004D","BE2022160","KYCX21_1557"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>Text-to-speech (TTS) synthesizers have been widely used as a vital assistive tool in various fields. Traditional sequence-to-sequence (seq2seq) TTS such as Tacotron2 uses a single soft attention mechanism for encoder and decoder alignment tasks, which is the biggest shortcoming that incorrectly or repeatedly generates words when dealing with long sentences. It may also generate sentences with run-on and wrong breaks regardless of punctuation marks, which causes the synthesized waveform to lack emotion and sound unnatural. In this paper, we propose an end-to-end neural generative TTS model that is based on the deep-inherited attention (DIA) mechanism along with an adjustable local-sensitive factor (LSF). The inheritance mechanism allows multiple iterations of the DIA by sharing the same training parameter, which tightens the token\u2013frame correlation, as well as fastens the alignment process. In addition, LSF is adopted to enhance the context connection by expanding the DIA concentration region. In addition, a multi-RNN block is used in the decoder for better acoustic feature extraction and generation. Hidden-state information driven from the multi-RNN layers is utilized for attention alignment. The collaborative work of the DIA and multi-RNN layers contributes to outperformance in the high-quality prediction of the phrase breaks of the synthesized speech. We used WaveGlow as a vocoder for real-time, human-like audio synthesis. Human subjective experiments show that the DIA-TTS achieved a mean opinion score (MOS) of 4.48 in terms of naturalness. Ablation studies further prove the superiority of the DIA mechanism for the enhancement of phrase breaks and attention robustness.<\/jats:p>","DOI":"10.3390\/e25010041","type":"journal-article","created":{"date-parts":[[2022,12,27]],"date-time":"2022-12-27T04:45:54Z","timestamp":1672116354000},"page":"41","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer"],"prefix":"10.3390","volume":"25","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4601-2942","authenticated-orcid":false,"given":"Junxiao","family":"Yu","sequence":"first","affiliation":[{"name":"Jiangsu Province Engineering Research Center of Smart Wearable and Rehabilitation Devices, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9052-6070","authenticated-orcid":false,"given":"Zhengyuan","family":"Xu","sequence":"additional","affiliation":[{"name":"Jiangsu Province Engineering Research Center of Smart Wearable and Rehabilitation Devices, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, China"},{"name":"Department of Medical Engineering, Wannan Medical College, Wuhu 241002, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6983-9379","authenticated-orcid":false,"given":"Xu","family":"He","sequence":"additional","affiliation":[{"name":"Jiangsu Province Engineering Research Center of Smart Wearable and Rehabilitation Devices, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, China"}]},{"given":"Jian","family":"Wang","sequence":"additional","affiliation":[{"name":"Jiangsu Province Engineering Research Center of Smart Wearable and Rehabilitation Devices, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3477-3178","authenticated-orcid":false,"given":"Bin","family":"Liu","sequence":"additional","affiliation":[{"name":"Jiangsu Province Engineering Research Center of Smart Wearable and Rehabilitation Devices, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, China"}]},{"given":"Rui","family":"Feng","sequence":"additional","affiliation":[{"name":"Jiangsu Province Engineering Research Center of Smart Wearable and Rehabilitation Devices, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4645-9322","authenticated-orcid":false,"given":"Songsheng","family":"Zhu","sequence":"additional","affiliation":[{"name":"Jiangsu Province Engineering Research Center of Smart Wearable and Rehabilitation Devices, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, China"}]},{"given":"Wei","family":"Wang","sequence":"additional","affiliation":[{"name":"Jiangsu Province Engineering Research Center of Smart Wearable and Rehabilitation Devices, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3524-8933","authenticated-orcid":false,"given":"Jianqing","family":"Li","sequence":"additional","affiliation":[{"name":"Jiangsu Province Engineering Research Center of Smart Wearable and Rehabilitation Devices, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, China"}]}],"member":"1968","published-online":{"date-parts":[[2022,12,26]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"52926","DOI":"10.1109\/ACCESS.2021.3069205","article-title":"Towards Assisting the Visually Impaired: A Review on Techniques for Decoding the Visual Data from Chart Images","volume":"9","author":"Shahira","year":"2021","journal-title":"IEEE Access"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Jiang, H., Gonnot, T., Yi, W.-J., and Saniie, J. (2017, January 14\u201317). Computer Vision and Text Recognition for Assisting Visually Impaired People Using Android Smartphone. Proceedings of the 2017 IEEE International Conference on Electro Information Technology (EIT), Lincoln, NE, USA.","DOI":"10.1109\/EIT.2017.8053384"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"2061","DOI":"10.1007\/s00034-021-01875-7","article-title":"Incorporation of Happiness in Neutral Speech by Modifying Time-Domain Parameters of Emotive-Keywords","volume":"41","author":"Gladston","year":"2022","journal-title":"Circuits Syst. Signal Process"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"28","DOI":"10.1016\/j.chb.2019.05.009","article-title":"Hey Alexa\u2026 Examine the Variables Influencing the Use of Artificial Intelligent in-Home Voice Assistants","volume":"99","author":"McLean","year":"2019","journal-title":"Comput. Hum. Behav."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Subhash, S., Srivatsa, P.N., Siddesh, S., Ullas, A., and Santhosh, B. (2020, January 27\u201328). Artificial Intelligence-Based Voice Assistant. Proceedings of the 2020 Fourth World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4), London, UK.","DOI":"10.1109\/WorldS450073.2020.9210344"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1404","DOI":"10.1109\/TASSP.1985.1164727","article-title":"Mixture Autoregressive Hidden Markov Models for Speech Signals","volume":"33","author":"Juang","year":"1985","journal-title":"IEEE Trans. Acoust. Speech Signal Process."},{"key":"ref_7","unstructured":"Sotelo, J.M.R., Mehri, S., Kumar, K., Santos, J.F., Kastner, K., and Courville, A.C. (2017, January 24\u201326). Char2Wav: End-to-End Speech Synthesis. Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), Toulon, France."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Wang, Y., Skerry-Ryan, R.J., Stanton, D., Wu, Y., Weiss, R.J., and Jaitly, N. (2017). Tacotron: Towards End-to-End Speech Synthesis. arXiv.","DOI":"10.21437\/Interspeech.2017-1452"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., and Yang, Z. (2018, January 15\u201320). Natural Tts Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461368"},{"key":"ref_10","unstructured":"Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., and Zhao, Z. (2019, January 8\u201314). Fastspeech: Fast, Robust and Controllable Text to Speech. Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada."},{"key":"ref_11","unstructured":"Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., and Zhao, Z. (2020). Fastspeech 2: Fast and High-Quality End-to-End Text to Speech. arXiv."},{"key":"ref_12","unstructured":"Ping, W., Peng, K., Gibiansky, A., Arik, S.O., Kannan, A., and Narang, S. (2017). Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. arXiv."},{"key":"ref_13","unstructured":"Arik, S., Diamos, G., Gibiansky, A., Miller, J., Peng, K., and Ping, W. (2017). Deep Voice 2: Multi-Speaker Neural Text-to-Speech. arXiv."},{"key":"ref_14","unstructured":"Ar\u0131k, S.\u00d6., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., and Kang, Y. (2017, January 6\u201311). Deep Voice: Real-Time Neural Text-to-Speech. Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia."},{"key":"ref_15","unstructured":"Oord, A.V.D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., and Graves, A. (2016). Wavenet: A Generative Model for Raw Audio. arXiv."},{"key":"ref_16","unstructured":"van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., and Graves, A. (2016, January 5\u201310). Conditional Image Generation with Pixelcnn Decoders. Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain."},{"key":"ref_17","unstructured":"Paine, T.L., Khorrami, P., Chang, S., Zhang, Y., Ramachandran, P., and Hasegawa-Johnson, M.A. (2016). Fast Wavenet Generation Algorithm. arXiv."},{"key":"ref_18","unstructured":"Mehri, S., Kumar, K., Gulrajani, I., Kumar, R., Jain, S., and Sotelo, J. (2016). Samplernn: An Unconditional End-to-End Neural Audio Generation Model. arXiv."},{"key":"ref_19","unstructured":"Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv."},{"key":"ref_20","unstructured":"Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. (2015, January 7\u201312). Attention-Based Models for Speech Recognition. Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Liu, R., Sisman, B., Li, J., Bao, F., Gao, G., and Li, H. (2020, January 4\u20138). Teacher-Student Training for Robust Tacotron-Based Tts. Proceedings of the ICASSP 2020\u20142020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9054681"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Liu, R., Bao, F., Gao, G., Zhang, H., and Wang, Y. (2018, January 2\u20136). Improving Mongolian Phrase Break Prediction by Using Syllable and Morphological Embeddings with Bilstm Model. Proceedings of the Interspeech, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1706"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"He, M., Deng, Y., and He, L. (2019). Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural Tts. arXiv.","DOI":"10.21437\/Interspeech.2019-1972"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"65955","DOI":"10.1109\/ACCESS.2019.2914149","article-title":"Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis","volume":"7","author":"Zhu","year":"2019","journal-title":"IEEE Access"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Elias, I., Zen, H., Shen, J., Zhang, Y., Jia, Y., and Skerry-Ryan, R.J. (2021). Parallel Tacotron 2: A Non-Autoregressive Neural Tts Model with Differentiable Duration Modeling. arXiv.","DOI":"10.21437\/Interspeech.2021-1461"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Elias, I., Zen, H., Shen, J., Zhang, Y., Jia, Y., and Weiss, R.J. (2021, January 6\u201311). Parallel Tacotron: Non-Autoregressive and Controllable Tts. Proceedings of the ICASSP 2021\u20142021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada.","DOI":"10.1109\/ICASSP39728.2021.9414718"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Okamoto, T., Toda, T., Shiga, Y., and Kawai, H. (2019, January 14\u201318). Tacotron-Based Acoustic Model Using Phoneme Alignment for Practical Neural Text-to-Speech Systems. Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore.","DOI":"10.1109\/ASRU46091.2019.9003956"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Prenger, R., Valle, R., and Catanzaro, B. (2019, January 12\u201317). Waveglow: A Flow-Based Generative Network for Speech Synthesis. Proceedings of the ICASSP 2019\u20142019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8683143"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Wang, Y., Stanton, D., Zhang, Y., Ryan, R., Battenberg, E., and Shor, J. (2018, January 10\u201315). Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden.","DOI":"10.1109\/SLT.2018.8639682"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 11\u201314). Identity Mappings in Deep Residual Networks. Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46493-0_38"},{"key":"ref_31","unstructured":"Kingma, D.P., and Dhariwal, P. (2018, January 3\u20138). Glow: Generative Flow with Invertible 1 \u00d7 1 Convolutions. Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, QC, Canada."},{"key":"ref_32","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv."},{"key":"ref_33","unstructured":"Theis, L., Oord, A.V.D., and Bethge, M. (2015). A Note on the Evaluation of Generative Models. arXiv."},{"key":"ref_34","unstructured":"Li, N., Liu, S., Liu, Y., Zhao, S., and Liu, M. (February, January 27). Neural Speech Synthesis with Transformer Network. Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA."}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/25\/1\/41\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T01:51:34Z","timestamp":1760147494000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/25\/1\/41"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,12,26]]},"references-count":34,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,1]]}},"alternative-id":["e25010041"],"URL":"https:\/\/doi.org\/10.3390\/e25010041","relation":{},"ISSN":["1099-4300"],"issn-type":[{"type":"electronic","value":"1099-4300"}],"subject":[],"published":{"date-parts":[[2022,12,26]]}}}