{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,3,11]],"date-time":"2025-03-11T04:12:10Z","timestamp":1741666330058,"version":"3.38.0"},"reference-count":60,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,3,10]],"date-time":"2025-03-10T00:00:00Z","timestamp":1741564800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,3,10]],"date-time":"2025-03-10T00:00:00Z","timestamp":1741564800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100014013","name":"UK Research and Innovation","doi-asserted-by":"publisher","award":["EP\/S022694\/1"],"award-info":[{"award-number":["EP\/S022694\/1"]}],"id":[{"id":"10.13039\/100014013","id-type":"DOI","asserted-by":"publisher"}]},{"name":"RAEng\/Leverhulme Trust","award":["LTRF2223-19-106"],"award-info":[{"award-number":["LTRF2223-19-106"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J AUDIO SPEECH MUSIC PROC."],"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>This paper introduces singing to speech conversion (S2S), a cross-domain voice conversion task, and presents the first deep learning-based S2S system. S2S aims to transform singing into speech while retaining the phonetic information, reducing variations in pitch, rhythm, and timbre. Inspired by the Glow-TTS architecture, the proposed model is built using generative flow, with an adjusted alignment module between the latent features. We adapt the original monotonic alignment search (MAS) to the S2S scenario and utilize a duration predictor to deal with the duration differences between the two modalities. Subjective evaluations show that the proposed model outperforms signal processing baselines in naturalness and outperforms a transcribe-and-synthesize baseline in phonetic similarity to the original singing. We further demonstrate that singing-to-speech could be an effective augmentation method for low-resource lyrics transcription.<\/jats:p>","DOI":"10.1186\/s13636-025-00400-x","type":"journal-article","created":{"date-parts":[[2025,3,10]],"date-time":"2025-03-10T12:57:37Z","timestamp":1741611457000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Singing to speech conversion with generative flow"],"prefix":"10.1186","volume":"2025","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-0413-7868","authenticated-orcid":false,"given":"Jiawen","family":"Huang","sequence":"first","affiliation":[]},{"given":"Emmanouil","family":"Benetos","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,3,10]]},"reference":[{"key":"400_CR1","unstructured":"A.M. Kruspe, in Proceedings of the 17th International Society for Music Information Retrieval Conference, ISMIR 2016, New York City, United States, August 7-11, 2016. Bootstrapping a system for phoneme recognition and keyword spotting in unaccompanied singing (ISMIR, 2016), pp. 358\u2013364"},{"key":"400_CR2","unstructured":"A.M. Kruspe, in Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR 2014, Taipei, Taiwan, October 27-31, 2014. Keyword spotting in a-capella singing (ISMR, 2014), pp. 271\u2013276"},{"issue":"2","key":"400_CR3","doi-asserted-by":"publisher","first-page":"49","DOI":"10.1177\/108471380400800203","volume":"8","author":"HJ McDermott","year":"2004","unstructured":"H.J. McDermott, Music perception with cochlear implants: a review. Trends Amplification 8(2), 49\u201382 (2004)","journal-title":"Trends Amplification"},{"key":"400_CR4","doi-asserted-by":"crossref","unstructured":"E. Demirel, S. Ahlb\u00e4ck, S. Dixon, in 29th European Signal Processing Conference, EUSIPCO 2021, Dublin, Ireland, August 23-27, 2021. Computational pronunciation analysis in sung utterances (IEEE, 2021), pp. 186\u2013190","DOI":"10.23919\/EUSIPCO54536.2021.9616147"},{"key":"400_CR5","doi-asserted-by":"crossref","unstructured":"C. Gupta, H. Li, Y. Wang, in 19th Annual Conference of the International Speech Communication Association, Interspeech 2018, Hyderabad, India, September 2-6, 2018. Automatic pronunciation evaluation of singing (ISCA, 2018), pp. 1507\u20131511","DOI":"10.21437\/Interspeech.2018-1267"},{"key":"400_CR6","unstructured":"L. Ou, X. Gu, Y. Wang, in Proceedings of the 23rd International Society for Music Information Retrieval Conference, ISMIR 2022 Bengaluru, India December 4-8, 2022. Towards transfer learning of wav2vec 2.0 for automatic lyric transcription (ISMIR, 2022), pp. 891\u2013899"},{"key":"400_CR7","unstructured":"X. Tan, T. Qin, F. Soong, T.Y. Liu, A survey on neural speech synthesis. (2021).\u00a0arXiv\u00a0preprint\u00a0arXiv:2106.15561"},{"key":"400_CR8","unstructured":"L. Dinh, D. Krueger, Y. Bengio,\u00a0in\u00a03rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings. NICE: non-linear independent components estimation\u00a0(2015)"},{"key":"400_CR9","doi-asserted-by":"crossref","unstructured":"J.X. Zhang, L.J. Liu, Y.N. Chen, Y.J. Hu, Y. Jiang, Z.H. Ling, L.R. Dai, in Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge, Shanghai, China, October 30, 2020. Voice conversion by cascading automatic speech recognition and text-to-speech synthesis with prosody transfer (ISCA, 2020), pp. 121\u2013125","DOI":"10.21437\/VCC_BC.2020-16"},{"key":"400_CR10","doi-asserted-by":"crossref","unstructured":"S.W. Park, D. Kim, M. Joe, in 21st Annual Conference of the International Speech Communication Association, Interspeech 2020, Virtual Event, Shanghai, China, October 25-29, 2020. Cotatron: Transcription-guided speech encoder for any-to-many voice conversion without parallel data (ISCA, 2020), pp. 4696\u20134700","DOI":"10.21437\/Interspeech.2020-1542"},{"key":"400_CR11","doi-asserted-by":"crossref","unstructured":"M. Proszewska, G. Beringer, D. S\u00e1ez-Trigueros, T. Merritt, A. Ezzerg, R. Barra-Chicote, in 23rd Annual Conference of the International Speech Communication Association, Interspeech 2022, Incheon, Korea, September 18-22, 2022. Glowvc: Mel-spectrogram space disentangling model for language-independent text-free voice conversion (ISCA, 2022), pp. 2973\u20132977","DOI":"10.21437\/Interspeech.2022-322"},{"key":"400_CR12","doi-asserted-by":"crossref","unstructured":"Y. Zhou, X. Tian, H. Xu, R.K. Das, H. Li, in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, United Kingdom, May 12-17, 2019. Cross-lingual voice conversion with bilingual phonetic posteriorgram and average modeling (IEEE, 2019), pp. 6790\u20136794","DOI":"10.1109\/ICASSP.2019.8683746"},{"key":"400_CR13","unstructured":"K. Qian, Y. Zhang, S. Chang, X. Yang, M. Hasegawa-Johnson, in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA. Autovc: Zero-shot voice style transfer with only autoencoder loss, vol. 97 (PMLR, 2019), pp. 5210\u20135219"},{"key":"400_CR14","doi-asserted-by":"crossref","unstructured":"J. Chou, H. Lee, in 20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, September 15-19, 2019. One-shot voice conversion by separating speaker and content representations with instance normalization (ISCA, 2019), pp. 664\u2013668","DOI":"10.21437\/Interspeech.2019-2663"},{"key":"400_CR15","unstructured":"Y.A. Li, A. Zare, N. Mesgarani, in 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30 - September 3, 2021. Starganv2-vc: A diverse, unsupervised, non-parallel framework for natural-sounding voice conversion (ISCA, 2021), pp. 1349\u20131353"},{"key":"400_CR16","doi-asserted-by":"crossref","unstructured":"T. Kaneko, H. Kameoka, in 26th European Signal Processing Conference, EUSIPCO 2018, Roma, Italy, September 3-7, 2018. Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks (IEEE, 2018), pp. 2100\u20132104","DOI":"10.23919\/EUSIPCO.2018.8553236"},{"key":"400_CR17","unstructured":"N. Takahashi, M.K. Singh, Y. Mitsufuji, Robust one-shot singing voice conversion.\u00a0(2022).\u00a0arXiv\u00a0preprint\u00a0arXiv:2210.11096"},{"key":"400_CR18","doi-asserted-by":"crossref","unstructured":"S. Liu, Y. Cao, N. Hu, D. Su, H. Meng, in 2021 IEEE International Conference on Multimedia and Expo, ICME 2021, Shenzhen, China, July 5-9, 2021. Fastsvc: Fast cross-domain singing voice conversion with feature-wise linear modulation (IEEE, 2021), pp. 1\u20136","DOI":"10.1109\/ICME51207.2021.9428161"},{"key":"400_CR19","doi-asserted-by":"crossref","unstructured":"Y. Luo, C. Hsu, K. Agres, D. Herremans, in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020. Singing voice conversion with disentangled representations of singer and vocal technique using variational autoencoders (IEEE, 2020), pp. 3277\u20133281","DOI":"10.1109\/ICASSP40776.2020.9054582"},{"key":"400_CR20","doi-asserted-by":"crossref","unstructured":"N. Takahashi, M.K. Singh, Y. Mitsufuji, in International Joint Conference on Neural Networks, IJCNN 2021, Shenzhen, China, July 18-22, 2021. Hierarchical disentangled representation learning for singing voice conversion (IEEE, 2021), pp. 1\u20137","DOI":"10.1109\/IJCNN52387.2021.9533583"},{"key":"400_CR21","doi-asserted-by":"crossref","unstructured":"S. Liu, Y. Cao, D. Su, H. Meng,\u00a0in\u00a0IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021, Cartagena, Colombia, December 13-17, 2021. Diffsvc: A diffusion probabilistic model for singing voice conversion (IEEE, 2021), pp. 741\u2013748","DOI":"10.1109\/ASRU51503.2021.9688219"},{"key":"400_CR22","unstructured":"SVC\u00a0Develop\u00a0Team. so-vits-svc. (2024). https:\/\/github.com\/svc-develop-team\/so-vits-svc. Accessed 09 Jul 2024"},{"key":"400_CR23","doi-asserted-by":"crossref","unstructured":"S. Wager, G. Tzanetakis, C. Wang, M. Kim, in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020. Deep autotuner: A pitch correcting network for singing performances (IEEE, 2020), pp. 246\u2013250","DOI":"10.1109\/ICASSP40776.2020.9054308"},{"key":"400_CR24","doi-asserted-by":"crossref","unstructured":"Y. Luo, M. Chen, T. Chi, L. Su, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018. Singing voice correction using canonical time warping (IEEE, 2018), pp. 156\u2013160","DOI":"10.1109\/ICASSP.2018.8461280"},{"key":"400_CR25","unstructured":"M. Morrison, Z. Jin, N.J. Bryan, J.P. Caceres, B. Pardo, Neural pitch-shifting and time-stretching with controllable lpcnet.\u00a0(2021).\u00a0arXiv\u00a0preprint\u00a0arXiv:2110.02360"},{"key":"400_CR26","unstructured":"B. O\u2019Connor, S. Dixon, G. Fazekas, in Proceedings of the 15th International Symposium on Computer Music Multidisciplinary Research, CMMR 2021, Tokyo, Japan, November 15-19, 2021. Zero-shot singing technique conversion (2021), pp. 235\u2013244"},{"key":"400_CR27","unstructured":"K. Qian, Y. Zhang, S. Chang, M. Hasegawa-Johnson, D.D. Cox, in Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event. Unsupervised speech decomposition via triple information bottleneck, vol. 119 (PMLR, 2020), pp. 7836\u20137846"},{"key":"400_CR28","doi-asserted-by":"crossref","unstructured":"S. Shechtman, A. Sorin, in 10th ISCA Workshop on Speech Synthesis, SSW 10, Vienna, Austria, September 20-22, 2019. Sequence to sequence neural speech synthesis with prosody modification capabilities (ISCA, 2019), pp. 275\u2013280","DOI":"10.21437\/SSW.2019-49"},{"key":"400_CR29","doi-asserted-by":"crossref","unstructured":"K. Vijayan, M. Dong, H. Li, in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017, Kuala Lumpur, Malaysia, December 12-15, 2017. A dual alignment scheme for improved speech-to-singing voice conversion (IEEE, 2017), pp. 1547\u20131555","DOI":"10.1109\/APSIPA.2017.8282289"},{"key":"400_CR30","doi-asserted-by":"crossref","unstructured":"L. Cen, M. Dong, P.Y. Chan, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2012, Kyoto, Japan, March 25-30, 2012. Template-based personalized singing voice synthesis (IEEE, 2012), pp. 4509\u20134512","DOI":"10.1109\/ICASSP.2012.6288920"},{"key":"400_CR31","doi-asserted-by":"crossref","unstructured":"B. Sharma, H. Li, in 20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, September 15-19, 2019. A combination of model-based and feature-based strategy for speech-to-singing alignment (ISCA, 2019), pp. 624\u2013628","DOI":"10.21437\/Interspeech.2019-1942"},{"key":"400_CR32","doi-asserted-by":"crossref","unstructured":"S. Agarwal, N. Takahashi, S. Ganapathy, in 23rd Annual Conference of the International Speech Communication Association, Interspeech 2022, Incheon, Korea, September 18-22, 2022. Leveraging symmetrical convolutional transformer networks for speech to singing voice style transfer (ISCA, 2022), pp. 3013\u20133017","DOI":"10.21437\/Interspeech.2022-11256"},{"key":"400_CR33","doi-asserted-by":"crossref","unstructured":"J. Parekh, P. Rao, Y. Yang, in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020. Speech-to-singing conversion in an encoder-decoder framework (IEEE, 2020), pp. 261\u2013265","DOI":"10.1109\/ICASSP40776.2020.9054473"},{"key":"400_CR34","unstructured":"D.P. Kingma, P. Dhariwal,\u00a0in\u00a0Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr\u00e9al, Canada. Glow: Generative flow with invertible 1x1 convolutions (2018), pp. 10236\u201310245"},{"key":"400_CR35","unstructured":"E. Hoogeboom, R. van den Berg, M. Welling, in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA. Emerging convolutions for generative normalizing flows, vol. 97 (PMLR, 2019), pp. 2771\u20132780"},{"key":"400_CR36","unstructured":"J. Kim, S. Kim, J. Kong, S. Yoon,\u00a0in\u00a0Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Glow-tts: A generative flow for text-to-speech via monotonic alignment search (2020)"},{"key":"400_CR37","unstructured":"R. Valle, K.J. Shih, R. Prenger, B. Catanzaro,\u00a0in\u00a09th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis (2021)"},{"key":"400_CR38","doi-asserted-by":"crossref","unstructured":"Y. Zhang, J. Cong, H. Xue, L. Xie, P. Zhu, M. Bi, in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022. Visinger: Variational inference with adversarial learning for end-to-end singing voice synthesis (IEEE, 2022), pp. 7237\u20137241","DOI":"10.1109\/ICASSP43922.2022.9747664"},{"key":"400_CR39","unstructured":"H. Choi, J. Yang, J. Lee, H. Kim,\u00a0in\u00a0The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. NANSY++: unified voice synthesis with neural analysis and synthesis (2023)"},{"key":"400_CR40","unstructured":"Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, T. Liu,\u00a0in\u00a0Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada. Fastspeech: Fast, robust and controllable text to speech (2019), pp. 3165\u20133174"},{"key":"400_CR41","doi-asserted-by":"crossref","unstructured":"C. Yu, H. Lu, N. Hu, M. Yu, C. Weng, K. Xu, P. Liu, D. Tuo, S. Kang, G. Lei, D. Su, D. Yu, in 21st Annual Conference of the International Speech Communication Association, Interspeech 2020, Virtual Event, Shanghai, China, October 25-29, 2020. Durian: Duration informed attention network for speech synthesis (ISCA, 2020), pp. 2027\u20132031","DOI":"10.21437\/Interspeech.2020-2968"},{"key":"400_CR42","doi-asserted-by":"crossref","unstructured":"A. Graves, S. Fern\u00e1ndez, F.J. Gomez, J. Schmidhuber, in Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25-29, 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, vol. 148 (ACM, 2006), pp. 369\u2013376","DOI":"10.1145\/1143844.1143891"},{"key":"400_CR43","doi-asserted-by":"crossref","unstructured":"Z. Duan, H. Fang, B. Li, K.C. Sim, Y. Wang, in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2013, Kaohsiung, Taiwan, October 29 - November 1, 2013. The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech (IEEE, 2013), pp. 1\u20139","DOI":"10.1109\/APSIPA.2013.6694316"},{"key":"400_CR44","doi-asserted-by":"crossref","unstructured":"E. Demirel, S. Ahlb\u00e4ck, S. Dixon,\u00a0in\u00a02020 International Joint Conference on Neural Networks, IJCNN 2020, Glasgow, United Kingdom, July 19-24, 2020. Automatic lyrics transcription using dilated convolutional neural networks with self-attention (IEEE, 2020), pp. 1\u20138","DOI":"10.1109\/IJCNN48605.2020.9207052"},{"key":"400_CR45","doi-asserted-by":"crossref","unstructured":"D.S. Park, W. Chan, Y. Zhang, C.C. Chiu, B. Zoph, E.D. Cubuk, Q.V. Le, in 20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, September 15-19, 2019. Specaugment: A simple data augmentation method for automatic speech recognition (2019), pp. 2613-2617","DOI":"10.21437\/Interspeech.2019-2680"},{"key":"400_CR46","unstructured":"I. Smule. Smule sing! 300x30x2 dataset. (2015). https:\/\/ccrma.stanford.edu\/damp\/. Accessed Dec 2024"},{"key":"400_CR47","doi-asserted-by":"crossref","unstructured":"G.R. Dabike, J. Barker, in 20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, September 15-19, 2019. Automatic lyric transcription from karaoke vocal tracks: Resources and a baseline system (ISCA, 2019), pp. 579\u2013583","DOI":"10.21437\/Interspeech.2019-2378"},{"key":"400_CR48","doi-asserted-by":"crossref","unstructured":"J. Shen, R. Pang, R.J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al., in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, Alberta, Canada, April 15-20, 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions (IEEE, 2018), pp. 4779\u20134783","DOI":"10.1109\/ICASSP.2018.8461368"},{"key":"400_CR49","doi-asserted-by":"publisher","first-page":"9","DOI":"10.1016\/j.specom.2021.07.002","volume":"133","author":"B Sharma","year":"2021","unstructured":"B. Sharma, X. Gao, K. Vijayan, X. Tian, H. Li, NHSS: A speech and singing parallel database. Speech Commun. 133, 9\u201322 (2021)","journal-title":"Speech Commun."},{"issue":"7","key":"400_CR50","doi-asserted-by":"publisher","first-page":"1877","DOI":"10.1587\/transinf.2015EDP7457","volume":"99","author":"M Morise","year":"2016","unstructured":"M. Morise, F. Yokomori, K. Ozawa, World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99(7), 1877\u20131884 (2016)","journal-title":"IEICE Trans. Inf. Syst."},{"key":"400_CR51","doi-asserted-by":"crossref","unstructured":"C. Lo, S. Fu, W. Huang, X. Wang, J. Yamagishi, Y. Tsao, H. Wang, in 20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, September 15-19, 2019. Mosnet: Deep learning-based objective assessment for voice conversion (ISCA, 2019), pp. 1541\u20131545","DOI":"10.21437\/Interspeech.2019-2003"},{"key":"400_CR52","doi-asserted-by":"crossref","unstructured":"A. Polyak, L. Wolf, Y. Adi, Y. Taigman, in 21st Annual Conference of the International Speech Communication Association, Interspeech 2020, Virtual Event, Shanghai, China, October 25-29, 2020. Unsupervised cross-domain singing voice conversion (ISCA, 2020), pp. 801\u2013805","DOI":"10.21437\/Interspeech.2020-1862"},{"issue":"12","key":"400_CR53","doi-asserted-by":"publisher","first-page":"2197","DOI":"10.1109\/TASLP.2014.2363788","volume":"22","author":"JF Santos","year":"2014","unstructured":"J.F. Santos, T.H. Falk, Updating the srmr-ci metric for improved intelligibility prediction for cochlear implant users. IEEE\/ACM Trans. Audio Speech Lang. Process. (TASLP) 22(12), 2197\u20132206 (2014)","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process. (TASLP)"},{"key":"400_CR54","doi-asserted-by":"crossref","unstructured":"J. Laver, Principles of phonetics (Cambridge University Press, Cambridge, UK, 1994), p. 391","DOI":"10.1017\/CBO9781139166621"},{"key":"400_CR55","doi-asserted-by":"crossref","unstructured":"R. Huang, F. Chen, Y. Ren, J. Liu, C. Cui, Z. Zhao, in MM \u201921: ACM Multimedia Conference, Virtual Event, China, October 20 - 24, 2021, Multi-singer: Fast multi-singer singing voice vocoder with A large-scale corpus (ACM, 2021), pp. 3945\u20133954","DOI":"10.1145\/3474085.3475437"},{"key":"400_CR56","unstructured":"M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.C. Chou, S.L. Yeh, S.W. Fu, C.F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y. Gao, R.D. Mori, Y. Bengio. SpeechBrain: A general-purpose speech toolkit.\u00a0(2021).\u00a0ArXiv:2106.04624"},{"issue":"8","key":"400_CR57","doi-asserted-by":"publisher","first-page":"1240","DOI":"10.1109\/JSTSP.2017.2763455","volume":"11","author":"S Watanabe","year":"2017","unstructured":"S. Watanabe, T. Hori, S. Kim, J.R. Hershey, T. Hayashi, Hybrid ctc\/attention architecture for end-to-end speech recognition. IEEE J. Sel. Top. Signal Process. 11(8), 1240\u20131253 (2017). https:\/\/doi.org\/10.1109\/JSTSP.2017.2763455","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"400_CR58","doi-asserted-by":"crossref","unstructured":"X. Gao, C. Gupta, H. Li,\u00a0in\u00a0IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022. Genre-conditioned acoustic models for automatic lyrics transcription of polyphonic music\u00a0(IEEE, 2022), pp. 791\u2013795","DOI":"10.1109\/ICASSP43922.2022.9747684"},{"key":"400_CR59","unstructured":"D.P. Kingma, J. Ba,\u00a0in\u00a03rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Adam: A method for stochastic optimization\u00a0(2015)"},{"key":"400_CR60","unstructured":"A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin,\u00a0in\u00a0Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. Attention is all you need (2017), pp. 5998\u20136008"}],"container-title":["EURASIP Journal on Audio, Speech, and Music Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-025-00400-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13636-025-00400-x\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-025-00400-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,10]],"date-time":"2025-03-10T12:58:02Z","timestamp":1741611482000},"score":1,"resource":{"primary":{"URL":"https:\/\/asmp-eurasipjournals.springeropen.com\/articles\/10.1186\/s13636-025-00400-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,10]]},"references-count":60,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["400"],"URL":"https:\/\/doi.org\/10.1186\/s13636-025-00400-x","relation":{},"ISSN":["1687-4722"],"issn-type":[{"value":"1687-4722","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,3,10]]},"assertion":[{"value":"20 December 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 February 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"10 March 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The ethics application for the subjective test was approved by the Queen Mary Ethics of Research Committee [reference number QMERC20.565.DSEECS23.009.]. All participants provided informed consent prior to their participation in the study.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"The authors declare that they have no competing interests.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"12"}}