{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,12]],"date-time":"2026-06-12T16:47:43Z","timestamp":1781282863725,"version":"3.54.1"},"reference-count":33,"publisher":"SAGE Publications","issue":"6","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["IDA"],"published-print":{"date-parts":[[2023,11,20]]},"abstract":"<jats:p>Recognizing a singing melody from an audio signal in terms of the music notes\u2019 pitch onset and offset, referred to as note-level singing melody transcription, has been studied as a critical task in the field of automatic music transcription. The task is challenging due to the different timbre and vibrato of each vocal and the ambiguity of onset and offset of the human voice compared with other instrumental sounds. This paper proposes a note-level singing melody transcription model using sequence-to-sequence Transformers. The singing melody annotation is expressed as a monophonic melody sequence and used as a decoder sequence. Overlapping decoding is introduced to solve the problem of the context between segments being broken. Applying pitch augmentation and and adding noisy dataset with data cleansing turns out to be effective in preventing overfitting and generalizing the model performance. Ablation studies demonstrate the effects of the proposed techniques in note-level singing melody transcription, both quantitatively and qualitatively. The proposed model outperforms other models in note-level singing melody transcription performance for all the metrics considered. For fundamental frequency metrics, the voice detection performance of the proposed model is comparable to that of a vocal melody extraction model. Finally, subjective human evaluation demonstrates that the results of the proposed models are perceived as more accurate than the results of a previous study.<\/jats:p>","DOI":"10.3233\/ida-227077","type":"journal-article","created":{"date-parts":[[2023,11,7]],"date-time":"2023-11-07T13:08:34Z","timestamp":1699362514000},"page":"1853-1871","source":"Crossref","is-referenced-by-count":3,"title":["Note-level singing melody transcription with transformers"],"prefix":"10.1177","volume":"27","author":[{"given":"Jonggwon","family":"Park","sequence":"first","affiliation":[{"name":"Department of Industrial Engineering and Institute for Industrial Systems Innovation, Seoul National University, Seoul, Korea"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Kyoyun","family":"Choi","sequence":"additional","affiliation":[{"name":"Institute of Engineering Research, Seoul National University, Seoul, Korea"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Seola","family":"Oh","sequence":"additional","affiliation":[{"name":"Department of Industrial Engineering and Institute for Industrial Systems Innovation, Seoul National University, Seoul, Korea"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Leekyung","family":"Kim","sequence":"additional","affiliation":[{"name":"Department of Industrial Engineering and Institute for Industrial Systems Innovation, Seoul National University, Seoul, Korea"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jonghun","family":"Park","sequence":"additional","affiliation":[{"name":"Department of Industrial Engineering and Institute for Industrial Systems Innovation, Seoul National University, Seoul, Korea"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"179","reference":[{"key":"10.3233\/IDA-227077_ref1","unstructured":"E. Molina, A.M. Barbancho, L.J. Tard\u00f3n and I. Barbancho, Evaluation framework for automatic singing transcription, in: Proceedings of International Society for Musical Information Retrieval (ISMIR), Taipei, Taiwan, 2014, pp.\u00a0567\u2013572."},{"key":"10.3233\/IDA-227077_ref2","unstructured":"M. Goto, AIST Annotation for the RWC Music Database, in: Proceedings of International Society for Musical Information Retrieval (ISMIR), Victoria, Canada, 2006, pp.\u00a0359\u2013360."},{"issue":"2","key":"10.3233\/IDA-227077_ref3","doi-asserted-by":"crossref","first-page":"73","DOI":"10.1162\/COMJ_a_00180","article-title":"Towards computer-assisted flamenco transcription: An experimental comparison of automatic transcription algorithms as applied to a cappella singing","volume":"37","author":"G\u00f3mez","year":"2013","journal-title":"Computer Music Jounrnal"},{"key":"10.3233\/IDA-227077_ref4","unstructured":"G. Meseguer-Brocal, A. Cohen-Hadria and G. Peeters, DALI: a large Dataset of synchronized Audio, LyrIcs and notes, automatically created using teacher-student machine learning paradigm, in: Proceedings of International Society for Musical Information Retrieval (ISMIR), Paris, France, 2018, pp.\u00a0431\u2013437."},{"key":"10.3233\/IDA-227077_ref5","unstructured":"G. Meseguer-Brocal, R. Bittner, S. Durand and B. Brost, Data Cleansing with Contrastive Learning for Vocal Note Event Annotations, in: Proceedings of International Society for Musical Information Retrieval (ISMIR), Montr\u00e9al, Canada, 2020, pp.\u00a0255\u2013262."},{"key":"10.3233\/IDA-227077_ref6","doi-asserted-by":"crossref","unstructured":"J.-Y. Wang and J.-S.R. Jang, On the Preparation and Validation of a Large-Scale Dataset of Singing Transcription, in: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Toronto, Canada, 2021, pp.\u00a0276\u2013280.","DOI":"10.1109\/ICASSP39728.2021.9414601"},{"key":"10.3233\/IDA-227077_ref7","doi-asserted-by":"crossref","unstructured":"M. Mauch and S. Dixon, PYIN: A fundamental frequency estimator using probabilistic threshold distributions, in: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, 2014, pp.\u00a0659\u2013663.","DOI":"10.1109\/ICASSP.2014.6853678"},{"issue":"4","key":"10.3233\/IDA-227077_ref8","doi-asserted-by":"crossref","first-page":"1917","DOI":"10.1121\/1.1458024","article-title":"YIN, a fundamental frequency estimator for speech and music","volume":"111","author":"de Cheveigne","year":"2002","journal-title":"The Journal of the Acoustical Society of America"},{"key":"10.3233\/IDA-227077_ref9","doi-asserted-by":"crossref","unstructured":"J. Kim, J. Salamon, P. Li and J.P. Bello, CREPE: A Convolutional Representation for Pitch Estimation, in: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, 2018, pp.\u00a0161\u2013165.","DOI":"10.1109\/ICASSP.2018.8461329"},{"key":"10.3233\/IDA-227077_ref10","doi-asserted-by":"crossref","first-page":"1118","DOI":"10.1109\/TASLP.2020.2982285","article-title":"SPICE: Self-Supervised Pitch Estimation","volume":"28","author":"Gfeller","year":"2020","journal-title":"IEEE\/ACM Transactions on Audio, Speech, and Language Processing"},{"key":"10.3233\/IDA-227077_ref11","doi-asserted-by":"crossref","unstructured":"S. Singh, R. Wang and Y. Qiu, DeepF0: End-To-End Fundamental Frequency Estimation for Music and Speech Signals, in: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, 2021, pp.\u00a061\u201365.","DOI":"10.1109\/ICASSP39728.2021.9414050"},{"issue":"7","key":"10.3233\/IDA-227077_ref12","doi-asserted-by":"crossref","first-page":"1324","DOI":"10.3390\/app9071324","article-title":"Joint detection and classification of singing voice melody using convolutional recurrent neural networks","volume":"9","author":"Kum","year":"2019","journal-title":"Applied Sciences"},{"key":"10.3233\/IDA-227077_ref13","unstructured":"M. Mauch, C. Cannam, R. Bittner, G. Fazekas, J. Salamon, J. Dai, J. Bello and S. Dixon, Computer-aided melody note transcription using the tony software: Accuracy and efficiency, in: Proceedings of the First International Conference on Technologies for Music Notation and Representation (TENOR), 2015."},{"key":"10.3233\/IDA-227077_ref14","unstructured":"Z.-S. Fu and L. Su, Hierarchical Classification Networks for Singing Voice Segmentation and Transcription, in: Proceedings of International Society for Musical Information Retrieval (ISMIR), Paris, France, 2018, pp.\u00a0900\u2013907."},{"key":"10.3233\/IDA-227077_ref15","doi-asserted-by":"crossref","unstructured":"S. Kum, J. Lee, K.L. Kim, T. Kim and J. Nam, Pseudo-Label Transfer from Frame-Level to Note-Level in a Teacher-Student Framework for Singing Transcription from Polyphonic Music, in: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, 2022, pp.\u00a0796\u2013800.","DOI":"10.1109\/ICASSP43922.2022.9747147"},{"key":"10.3233\/IDA-227077_ref16","doi-asserted-by":"crossref","unstructured":"B. McFee, C. Raffel, D. Liang, D. Ellis, M. McVicar, E. Battenberg and O. Nieto, librosa: Audio and music signal analysis in python, in: Proceedings of the 14th Python in Science Conference (SciPy 2015), 2015, pp.\u00a018\u201325.","DOI":"10.25080\/Majora-7b98e3ed-003"},{"key":"10.3233\/IDA-227077_ref17","unstructured":"C. Hawthorne, I. Simon, R. Swavely, E. Manilow and J. Engel, Sequence-to-Sequence Piano Transcription with Transformers, in: Proceedings of International Society for Musical Information Retrieval (ISMIR), Online, 2021, pp.\u00a0246\u2013253."},{"key":"10.3233\/IDA-227077_ref18","unstructured":"A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser and I. Polosukhin, Attention is all you need, in: Proceedings of International Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 2017."},{"issue":"6","key":"10.3233\/IDA-227077_ref19","doi-asserted-by":"crossref","first-page":"1554","DOI":"10.1214\/aoms\/1177699147","article-title":"Statistical Inference for Probabilistic Functions of Finite State Markov Chains","volume":"37","author":"Baum","year":"1966","journal-title":"The Annals of Mathematical Statistics"},{"key":"10.3233\/IDA-227077_ref21","doi-asserted-by":"crossref","unstructured":"L. Su, Vocal melody extraction using patch-based CNN, in: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, 2018, pp.\u00a0371\u2013375.","DOI":"10.1109\/ICASSP.2018.8462420"},{"key":"10.3233\/IDA-227077_ref22","doi-asserted-by":"crossref","first-page":"186126","DOI":"10.1109\/ACCESS.2019.2960566","article-title":"Shakedrop regularization for deep residual learning","volume":"7","author":"Yamada","year":"2019","journal-title":"IEEE Access"},{"issue":"8","key":"10.3233\/IDA-227077_ref23","doi-asserted-by":"crossref","first-page":"1979","DOI":"10.1109\/TPAMI.2018.2858821","article-title":"Virtual adversarial training: A regularization method for supervised and semi-supervised learning","volume":"41","author":"Miyato","year":"2018","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"10.3233\/IDA-227077_ref24","unstructured":"M. Tan and Q.V. LE, EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, in: Proceedings of International Conference on Machine Learning (ICML), 2019, pp.\u00a06105\u20136114."},{"key":"10.3233\/IDA-227077_ref25","unstructured":"S. Kum, J.H. Lin, L. Su and J. Nam, Semi-supervised learning using teacher-student models for vocal melody extraction, in: Proceedings of International Society for Musical Information Retrieval (ISMIR), Montr\u00e9al, Canada, 2020, pp.\u00a093\u2013100."},{"key":"10.3233\/IDA-227077_ref26","unstructured":"C. Donahue and P. Liang, SHEET SAGE: LEAD SHEETS FROM MUSIC AUDIO, in: Extended Abstracts for the Late-Breaking Demo Session of the International Society for Music Information Retrieval Conference, Online, 2021."},{"issue":"140","key":"10.3233\/IDA-227077_ref29","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text Transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"Journal of Machine Learning Research"},{"key":"10.3233\/IDA-227077_ref31","unstructured":"R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang and T.-Y. Liu, On Layer Normalization in the Transformer Architecture, in: Proceedings of International Conference on Machine Learning (ICML), Online, 2020, p.\u00a0PMLR 119."},{"issue":"1","key":"10.3233\/IDA-227077_ref32","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s40537-019-0197-0","article-title":"A survey on image data augmentation for deep learning","volume":"6","author":"Shorten","year":"2019","journal-title":"Journal of Big Data"},{"key":"10.3233\/IDA-227077_ref33","doi-asserted-by":"crossref","unstructured":"R.L. Aguiar, Y.M.G. Costa and C.N. Silla, Exploring Data Augmentation to Improve Music Genre Classification with ConvNets, in: 2018 International Joint Conference on Neural Networks (IJCNN), 2018, pp.\u00a01\u20138.","DOI":"10.1109\/IJCNN.2018.8489166"},{"key":"10.3233\/IDA-227077_ref34","unstructured":"R.M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam and J.P. Bello, Medleydb: A multitrack dataset for annotation-intensive mir research, in: Proceedings of International Society for Musical Information Retrieval (ISMIR), 2014, pp.\u00a0155\u2013160."},{"key":"10.3233\/IDA-227077_ref35","unstructured":"R.M. Bittner, B. McFee, J. Salamon, P. Li and J.P. Bello, Deep salience representations for F0 estimation in polyphonic music, in: Proceedings of International Society for Musical Information Retrieval (ISMIR), Suzhou, China, 2017, pp.\u00a063\u201370."},{"key":"10.3233\/IDA-227077_ref36","unstructured":"C. Raffel, B. McFee, E.J. Humphrey, J. Salamon, O. Nieto, D. Liang and D.P.W. Ellis, Mir_eval: A transparent implementation of common mir metrics, in: Proceedings of International Society for Musical Information Retrieval (ISMIR), Taipei, Taiwan, 2014, pp.\u00a0367\u2013372."},{"issue":"50","key":"10.3233\/IDA-227077_ref37","doi-asserted-by":"crossref","first-page":"2154","DOI":"10.21105\/joss.02154","article-title":"Spleeter: A fast and efficient music source separation tool with pre-trained models","volume":"5","author":"Hennequin","year":"2020","journal-title":"Journal of Open Source Software"}],"container-title":["Intelligent Data Analysis"],"original-title":[],"link":[{"URL":"https:\/\/content.iospress.com\/download?id=10.3233\/IDA-227077","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T09:20:17Z","timestamp":1777454417000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/full\/10.3233\/IDA-227077"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,11,20]]},"references-count":33,"journal-issue":{"issue":"6"},"URL":"https:\/\/doi.org\/10.3233\/ida-227077","relation":{},"ISSN":["1088-467X","1571-4128"],"issn-type":[{"value":"1088-467X","type":"print"},{"value":"1571-4128","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,11,20]]}}}