{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,18]],"date-time":"2026-06-18T10:39:33Z","timestamp":1781779173042,"version":"3.54.5"},"reference-count":80,"publisher":"Association for Computing Machinery (ACM)","issue":"7","license":[{"start":{"date-parts":[[2024,5,16]],"date-time":"2024-05-16T00:00:00Z","timestamp":1715817600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001459","name":"Ministry of Education in Singapore","doi-asserted-by":"crossref","award":["MOE-T2EP20120-0012"],"award-info":[{"award-number":["MOE-T2EP20120-0012"]}],"id":[{"id":"10.13039\/501100001459","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,7,31]]},"abstract":"<jats:p>Automatic lyric transcription (ALT) refers to transcribing singing voices into lyrics, while automatic music transcription (AMT) refers to transcribing singing voices into note events, i.e., musical MIDI notes. Despite these two tasks having significant potential for practical application, they are still nascent. This is because the transcription of lyrics and note events solely from singing audio is notoriously difficult due to the presence of noise contamination, e.g., musical accompaniment, resulting in a degradation of both the intelligibility of sung lyrics and the recognizability of sung notes. To address this challenge, we propose a general framework for implementing multimodal ALT and AMT systems. Additionally, we curate the first multimodal singing dataset, comprising N20EMv1 and N20EMv2, which encompasses audio recordings and videos of lip movements, together with ground truth for lyrics and note events. For model construction, we propose adapting self-supervised learning models from the speech domain as acoustic encoders and visual encoders to alleviate the scarcity of labeled data. We also introduce a residual cross-attention mechanism to effectively integrate features from the audio and video modalities. Through extensive experiments, we demonstrate that our single-modal systems exhibit state-of-the-art performance on both ALT and AMT tasks. Subsequently, through single-modal experiments, we also explore the individual contributions of each modality to the multimodal system. Finally, we combine these and demonstrate the effectiveness of our proposed multimodal systems, particularly in terms of their noise robustness.<\/jats:p>","DOI":"10.1145\/3651310","type":"journal-article","created":{"date-parts":[[2024,3,12]],"date-time":"2024-03-12T12:04:28Z","timestamp":1710245068000},"page":"1-29","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":10,"title":["Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0637-8664","authenticated-orcid":false,"given":"Xiangming","family":"Gu","sequence":"first","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1725-8361","authenticated-orcid":false,"given":"Longshen","family":"Ou","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore Singapore"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0953-5314","authenticated-orcid":false,"given":"Wei","family":"Zeng","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore Singapore"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4969-1880","authenticated-orcid":false,"given":"Jianan","family":"Zhang","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore Singapore"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-6473-6938","authenticated-orcid":false,"given":"Nicholas","family":"Wong","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore Singapore"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0123-1260","authenticated-orcid":false,"given":"Ye","family":"Wang","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore Singapore"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2024,5,16]]},"reference":[{"key":"e_1_3_4_2_2","article-title":"LRS3-TED: A large-scale dataset for visual speech recognition","author":"Afouras Triantafyllos","year":"2018","unstructured":"Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018. LRS3-TED: A large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018).","journal-title":"arXiv preprint arXiv:1809.00496"},{"key":"e_1_3_4_3_2","first-page":"4603","volume-title":"2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201922)","author":"Arroyo V\u00edctor","year":"2022","unstructured":"V\u00edctor Arroyo, Jose J. Valero-Mas, Jorge Calvo-Zaragoza, and Antonio Pertusa. 2022. Neural audio-to-score music transcription for unconstrained polyphony using compact output representations. In 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201922). IEEE, 4603\u20134607."},{"key":"e_1_3_4_4_2","first-page":"12449","article-title":"wav2vec 2.0: A framework for self-supervised learning of speech representations","volume":"33","author":"Baevski Alexei","year":"2020","unstructured":"Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33 (2020), 12449\u201312460.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_4_5_2","first-page":"266","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Basak Sakya","year":"2021","unstructured":"Sakya Basak, Shrutina Agarwal, Sriram Ganapathy, and Naoya Takahashi. 2021. End-to-end lyrics recognition with voice to singing style transfer. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 266\u2013270."},{"key":"e_1_3_4_6_2","first-page":"621","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Chen Ke","year":"2022","unstructured":"Ke Chen, Shuai Yu, Cheng-i Wang, Wei Li, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. 2022. Tonet: Tone-octave network for singing melody extraction from polyphonic music. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 621\u2013625."},{"key":"e_1_3_4_7_2","unstructured":"Jan K. Chorowski Dzmitry Bahdanau Dmitriy Serdyuk Kyunghyun Cho and Yoshua Bengio. 2015. Attention-based models for speech recognition. Advances in Neural Information Processing Systems 28 (2015). 577\u2013585."},{"key":"e_1_3_4_8_2","first-page":"1086","article-title":"VoxCeleb2: Deep speaker recognition","author":"Chung Joon Son","year":"2018","unstructured":"Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. VoxCeleb2: Deep speaker recognition. In Proceedings of Interspeech. 1086\u20131090.","journal-title":"Proceedings of Interspeech"},{"key":"e_1_3_4_9_2","first-page":"579","volume-title":"Proceedings of Interspeech","author":"Dabike Gerardo Roa","year":"2019","unstructured":"Gerardo Roa Dabike and Jon Barker. 2019. Automatic lyric transcription from karaoke vocal tracks: Resources and a baseline system. In Proceedings of Interspeech. 579\u2013583."},{"issue":"2","key":"e_1_3_4_10_2","doi-asserted-by":"crossref","first-page":"164","DOI":"10.1080\/13682820801997189","article-title":"Investigating the psycholinguistic correlates of speechreading in preschool age children","volume":"44","author":"Davies Rebecca","year":"2009","unstructured":"Rebecca Davies, Evan Kidd, and Karen Lander. 2009. Investigating the psycholinguistic correlates of speechreading in preschool age children. International Journal of Language & Communication Disorders 44, 2 (2009), 164\u2013174.","journal-title":"International Journal of Language & Communication Disorders"},{"key":"e_1_3_4_11_2","doi-asserted-by":"publisher","DOI":"10.1121\/1.1458024"},{"key":"e_1_3_4_12_2","volume-title":"Proceedings of the ISMIR 2021 Workshop on Music Source Separation","author":"D\u00e9fossez Alexandre","year":"2021","unstructured":"Alexandre D\u00e9fossez. 2021. Hybrid spectrogram and waveform source separation. In Proceedings of the ISMIR 2021 Workshop on Music Source Separation."},{"key":"e_1_3_4_13_2","article-title":"Music source separation in the waveform domain","author":"D\u00e9fossez Alexandre","year":"2019","unstructured":"Alexandre D\u00e9fossez, Nicolas Usunier, L\u00e9on Bottou, and Francis Bach. 2019. Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254 (2019).","journal-title":"arXiv preprint arXiv:1911.13254"},{"key":"e_1_3_4_14_2","first-page":"1","volume-title":"International Joint Conference on Neural Networks (IJCNN)","author":"Demirel Emir","year":"2020","unstructured":"Emir Demirel, Sven Ahlb\u00e4ck, and Simon Dixon. 2020. Automatic lyrics transcription using dilated convolutional neural networks with self-attention. In International Joint Conference on Neural Networks (IJCNN). IEEE, 1\u20138."},{"key":"e_1_3_4_15_2","first-page":"586","volume-title":"2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201921)","author":"Demirel Emir","year":"2021","unstructured":"Emir Demirel, Sven Ahlb\u00e4ck, and Simon Dixon. 2021. Low resource audio-to-lyrics alignment from polyphonic music recordings. In 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201921). IEEE, 586\u2013590."},{"key":"e_1_3_4_16_2","first-page":"151","volume-title":"Proceedings of the 22nd International Society for Music Information Retrieval Conference (ISMIR)","author":"Demirel Emir","year":"2021","unstructured":"Emir Demirel, Sven Ahlb\u00e4ck, and Simon Dixon. 2021. MSTRE-Net: Multistreaming acoustic modeling for automatic lyrics transcription. In Proceedings of the 22nd International Society for Music Information Retrieval Conference (ISMIR). 151\u2013158."},{"key":"e_1_3_4_17_2","unstructured":"Tengyu Deng Eita Nakamura and Kazuyoshi Yoshii. 2022. End-to-end lyrics transcription informed by pitch and onset estimation. In The 23rd International Society for Music Information Retrieval Conference (ISMIR). 633\u2013639."},{"key":"e_1_3_4_18_2","first-page":"1","volume-title":"Asia-Pacific Signal and Information Processing Association Annual Summit and Conference","author":"Duan Zhiyan","year":"2013","unstructured":"Zhiyan Duan, Haotian Fang, Bo Li, Khe Chai Sim, and Ye Wang. 2013. The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech. In Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. IEEE, 1\u20139."},{"key":"e_1_3_4_19_2","unstructured":"G. Dzhambazov. 2017. Knowledge-based probabilistic modeling for tracking lyrics in music audio signals. Ph. D. Dissertation. Universitat Pompeu Fabra."},{"key":"e_1_3_4_20_2","first-page":"900","volume-title":"Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR)","author":"Fu Zih-Sing","year":"2019","unstructured":"Zih-Sing Fu and Li Su. 2019. Hierarchical classification networks for singing voice segmentation and transcription. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR). 900\u2013907."},{"key":"e_1_3_4_21_2","first-page":"281","volume-title":"Proceedings of the 9th International Society for Music Information Retrieval Conference (ISMIR)","author":"Fujihara Hiromasa","year":"2008","unstructured":"Hiromasa Fujihara, Masataka Goto, and Jun Ogata. 2008. Hyperlinking lyrics: A method for creating hyperlinks between phrases in song lyrics. In Proceedings of the 9th International Society for Music Information Retrieval Conference (ISMIR). 281\u2013286."},{"key":"e_1_3_4_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2022.3190742"},{"key":"e_1_3_4_23_2","first-page":"791","volume-title":"2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201922)","author":"Gao Xiaoxue","year":"2022","unstructured":"Xiaoxue Gao, Chitralekha Gupta, and Haizhou Li. 2022. Genre-conditioned acoustic models for automatic lyrics transcription of polyphonic music. In 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201922). IEEE, 791\u2013795."},{"key":"e_1_3_4_24_2","first-page":"1","volume-title":"2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201923)","author":"Gao Xiaoxue","year":"2023","unstructured":"Xiaoxue Gao, Xianghu Yue, and Haizhou Li. 2023. Self-Transriber: Few-shot lyrics transcription with self-training. In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201923). IEEE, 1\u20135."},{"key":"e_1_3_4_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2020.2982285"},{"key":"e_1_3_4_26_2","doi-asserted-by":"publisher","DOI":"10.1162\/COMJ_a_00180"},{"key":"e_1_3_4_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2013.2295918"},{"key":"e_1_3_4_28_2","first-page":"3328","volume-title":"Proceedings of the 30th ACM International Conference on Multimedia","author":"Gu Xiangming","year":"2022","unstructured":"Xiangming Gu, Longshen Ou, Danielle Ong, and Ye Wang. 2022. MM-ALT: A multimodal automatic lyric transcription system. In Proceedings of the 30th ACM International Conference on Multimedia. 3328\u20133337."},{"key":"e_1_3_4_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3612272"},{"key":"e_1_3_4_30_2","first-page":"496","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Gupta Chitralekha","year":"2020","unstructured":"Chitralekha Gupta, Emre Y\u0131lmaz, and Haizhou Li. 2020. Automatic lyrics alignment and transcription in polyphonic music: Does background music help? In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 496\u2013500."},{"key":"e_1_3_4_31_2","first-page":"494","volume-title":"9th Sound and Music Computing Conference (SMC)","author":"Hansen Jens Kofod","year":"2012","unstructured":"Jens Kofod Hansen and I. D. M. T. Fraunhofer. 2012. Recognition of phonemes in a-cappella recordings using temporal patterns and mel frequency cepstral coefficients. In 9th Sound and Music Computing Conference (SMC). 494\u2013499."},{"key":"e_1_3_4_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_4_33_2","unstructured":"Toru Hosoya Motoyuki Suzuki Akinori Ito Shozo Makino Lloyd A. Smith David Bainbridge and Ian H. Witten. 2005. Lyrics recognition from a singing voice based on finite state automaton for music information retrieval. In Proceedings of the 6th International Society for Music Information Retrieval Conference (ISMIR). 532\u2013535."},{"key":"e_1_3_4_34_2","first-page":"293","volume-title":"Proceedings of the 22nd International Society for Music Information Retrieval Conference (ISMIR)","author":"Hsu Jui-Yang","year":"2021","unstructured":"Jui-Yang Hsu and Li Su. 2021. VOCANO: A note transcription framework for singing voice in polyphonic music.. In Proceedings of the 22nd International Society for Music Information Retrieval Conference (ISMIR). 293\u2013300."},{"key":"e_1_3_4_35_2","first-page":"1440","volume-title":"IEEE International Conference on Image Processing (ICIP)","author":"Hu Xinxin","year":"2019","unstructured":"Xinxin Hu, Kailun Yang, Lei Fei, and Kaiwei Wang. 2019. Acnet: Attention based network to exploit complementary features for RGBD semantic segmentation. In IEEE International Conference on Image Processing (ICIP). IEEE, 1440\u20131444."},{"issue":"1","key":"e_1_3_4_36_2","doi-asserted-by":"crossref","first-page":"99","DOI":"10.1109\/TASL.2012.2215589","article-title":"Pitch estimation in noisy speech using accumulated peak spectrum and sparse estimation technique","volume":"21","author":"Huang Feng","year":"2012","unstructured":"Feng Huang and Tan Lee. 2012. Pitch estimation in noisy speech using accumulated peak spectrum and sparse estimation technique. IEEE Transactions on Audio, Speech, and Language Processing 21, 1 (2012), 99\u2013109.","journal-title":"IEEE Transactions on Audio, Speech, and Language Processing"},{"key":"e_1_3_4_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3547854"},{"key":"e_1_3_4_38_2","unstructured":"Yu Huang Chenzhuang Du Zihui Xue Xuanyao Chen Hang Zhao and Longbo Huang. 2021. What makes multi-modal learning better than single (provably). Advances in Neural Information Processing Systems 34 (2021). 10944\u201310956."},{"key":"e_1_3_4_39_2","first-page":"161","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Kim Jong Wook","year":"2018","unstructured":"Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. 2018. Crepe: A convolutional representation for pitch estimation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 161\u2013165."},{"key":"e_1_3_4_40_2","article-title":"Adam: A method for stochastic optimization","author":"Kingma Diederik P.","year":"2014","unstructured":"Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).","journal-title":"arXiv preprint arXiv:1412.6980"},{"key":"e_1_3_4_41_2","first-page":"336","volume-title":"Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR)","author":"Kruspe Anna M.","year":"2015","unstructured":"Anna M. Kruspe. 2015. Training phoneme models for singing with \u201dsongified\u201d speech data. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR). 336\u2013342."},{"key":"e_1_3_4_42_2","first-page":"796","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Kum Sangeun","year":"2022","unstructured":"Sangeun Kum, Jongpil Lee, Keunhyoung Luke Kim, Taehyoung Kim, and Juhan Nam. 2022. Pseudo-label transfer from frame-level to note-level in a teacher-student framework for singing transcription from polyphonic music. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 796\u2013800."},{"key":"e_1_3_4_43_2","volume-title":"International Conference on Learning Representations","author":"Kumar Ananya","year":"2022","unstructured":"Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, and Percy Liang. 2022. Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations."},{"key":"e_1_3_4_44_2","article-title":"MERT: Acoustic music understanding model with large-scale self-supervised training","author":"Li Yizhi","year":"2023","unstructured":"Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghua Lin, Anton Ragni, Emmanouil Benetos, Norbert Gyenge, et\u00a0al. 2023. MERT: Acoustic music understanding model with large-scale self-supervised training. arXiv preprint arXiv:2306.00107 (2023).","journal-title":"arXiv preprint arXiv:2306.00107"},{"key":"e_1_3_4_45_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i10.21350"},{"key":"e_1_3_4_46_2","first-page":"7613","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Ma Pingchuan","year":"2021","unstructured":"Pingchuan Ma, Stavros Petridis, and Maja Pantic. 2021. End-to-end audio-visual speech recognition with conformers. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7613\u20137617."},{"key":"e_1_3_4_47_2","unstructured":"Matthias Mauch Chris Cannam Rachel Bittner George Fazekas Justin Salamon Jiajie Dai Juan Bello and Simon Dixon. 2015. Computer-aided melody note transcription using the tony software: accuracy and efficiency. In Proceedings of International Conference on Technologies for Music Notation and Representation (TENOR). 23\u201330."},{"key":"e_1_3_4_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2014.6853678"},{"issue":"1","key":"e_1_3_4_49_2","doi-asserted-by":"crossref","first-page":"200","DOI":"10.1109\/TASL.2011.2159595","article-title":"Integrating additional chord information into HMM-based lyrics-to-audio alignment","volume":"20","author":"Mauch Matthias","year":"2011","unstructured":"Matthias Mauch, Hiromasa Fujihara, and Masataka Goto. 2011. Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing 20, 1 (2011), 200\u2013210.","journal-title":"IEEE Transactions on Audio, Speech, and Language Processing"},{"key":"e_1_3_4_50_2","doi-asserted-by":"publisher","DOI":"10.1038\/264746a0"},{"key":"e_1_3_4_51_2","first-page":"3117","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"McVicar Matt","year":"2014","unstructured":"Matt McVicar, Daniel P. W. Ellis, and Masataka Goto. 2014. Leveraging repetition for improved automatic lyric transcription in popular music. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3117\u20133121."},{"issue":"4312","key":"e_1_3_4_52_2","doi-asserted-by":"crossref","first-page":"75","DOI":"10.1126\/science.198.4312.75","article-title":"Imitation of facial and manual gestures by human neonates","volume":"198","author":"Meltzoff Andrew N.","year":"1977","unstructured":"Andrew N. Meltzoff and M. Keith Moore. 1977. Imitation of facial and manual gestures by human neonates. Science 198, 4312 (1977), 75\u201378.","journal-title":"Science"},{"key":"e_1_3_4_53_2","first-page":"2146","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Mesaros Annamaria","year":"2010","unstructured":"Annamaria Mesaros and Tuomas Virtanen. 2010. Recognition of phonemes and words in singing. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2146\u20132149."},{"key":"e_1_3_4_54_2","article-title":"Dali: A large dataset of synchronized audio, lyrics and notes, automatically created using teacher-student machine learning paradigm","author":"Meseguer-Brocal Gabriel","year":"2019","unstructured":"Gabriel Meseguer-Brocal, Alice Cohen-Hadria, and Geoffroy Peeters. 2019. Dali: A large dataset of synchronized audio, lyrics and notes, automatically created using teacher-student machine learning paradigm. arXiv preprint arXiv:1906.10606 (2019).","journal-title":"arXiv preprint arXiv:1906.10606"},{"key":"e_1_3_4_55_2","doi-asserted-by":"crossref","unstructured":"Gabriel Meseguer-Brocal Alice Cohen-Hadria and Geoffroy Peeters. 2020. Creating DALI a large dataset of synchronized audio lyrics and notes. Transactions of the International Society for Music Information Retrieval 3 1 (2020). 55\u201367.","DOI":"10.5334\/tismir.30"},{"key":"e_1_3_4_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2858821"},{"key":"e_1_3_4_57_2","unstructured":"Emilio Molina Ana Maria Barbancho-Perez Lorenzo Jose Tardon-Garcia and Isabel Barbancho-Perez. 2014. Evaluation framework for automatic singing transcription. In Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR). 567\u2013572."},{"key":"e_1_3_4_58_2","volume-title":"Dagstuhl Reports","author":"M\u00fcller Meinard","year":"2019","unstructured":"Meinard M\u00fcller, Emilia G\u00f3mez, and Yi-Hsun Yang. 2019. Computational methods for melody and voice processing in music recordings (Dagstuhl seminar 19052). In Dagstuhl Reports, Vol. 9. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik."},{"key":"e_1_3_4_59_2","doi-asserted-by":"crossref","first-page":"1679","DOI":"10.1145\/3240508.3240691","volume-title":"Proceedings of the 26th ACM International Conference on Multimedia","author":"Murad Dania","year":"2018","unstructured":"Dania Murad, Riwu Wang, Douglas Turnbull, and Ye Wang. 2018. SLIONS: A karaoke application to enhance foreign language learning. In Proceedings of the 26th ACM International Conference on Multimedia. 1679\u20131687."},{"key":"e_1_3_4_60_2","doi-asserted-by":"publisher","DOI":"10.1017\/ATSIP.2021.4"},{"key":"e_1_3_4_61_2","unstructured":"Longshen Ou Xiangming Gu and Ye Wang. 2022. Transfer learning of wav2vec 2.0 for automatic lyric transcription. In Proceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR). 891\u2013899."},{"key":"e_1_3_4_62_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.acl-long.308"},{"key":"e_1_3_4_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2015.7178964"},{"key":"e_1_3_4_64_2","unstructured":"Colin Raffel Brian McFee Eric J. Humphrey Justin Salamon Oriol Nieto Dawen Liang Daniel P. W. Ellis and C. Colin Raffel. 2014. mir_eval: A transparent implementation of common MIR metrics. In Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR). 367\u2013372."},{"key":"e_1_3_4_65_2","first-page":"34","volume-title":"ISMIR","author":"Rom\u00e1n Miguel A.","year":"2018","unstructured":"Miguel A. Rom\u00e1n, Antonio Pertusa, and Jorge Calvo-Zaragoza. 2018. An end-to-end framework for audio-to-score music transcription on monophonic excerpts. In ISMIR. 34\u201341."},{"key":"e_1_3_4_66_2","doi-asserted-by":"crossref","first-page":"13525","DOI":"10.1109\/ICRA48506.2021.9561675","volume-title":"IEEE International Conference on Robotics and Automation (ICRA)","author":"Seichter Daniel","year":"2021","unstructured":"Daniel Seichter, Mona K\u00f6hler, Benjamin Lewandowski, Tim Wengefeld, and Horst-Michael Gross. 2021. Efficient RGB-D semantic segmentation for indoor scene analysis. In IEEE International Conference on Robotics and Automation (ICRA). IEEE, 13525\u201313531."},{"key":"e_1_3_4_67_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2019.2955253"},{"key":"e_1_3_4_68_2","volume-title":"International Conference on Learning Representations","author":"Shi Bowen","year":"2022","unstructured":"Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, and Abdelrahman Mohamed. 2022. Learning audio-visual speech representation by masked multimodal cluster prediction. In International Conference on Learning Representations."},{"key":"e_1_3_4_69_2","article-title":"Robust self-supervised audio-visual speech recognition","author":"Shi Bowen","year":"2022","unstructured":"Bowen Shi, Wei-Ning Hsu, and Abdelrahman Mohamed. 2022. Robust self-supervised audio-visual speech recognition. arXiv preprint arXiv:2201.01763 (2022).","journal-title":"arXiv preprint arXiv:2201.01763"},{"key":"e_1_3_4_70_2","first-page":"181","volume-title":"2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201919)","author":"Stoller Daniel","year":"2019","unstructured":"Daniel Stoller, Simon Durand, and Sebastian Ewert. 2019. End-to-end lyrics alignment for polyphonic music using an audio-to-character recognition model. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201919). IEEE, 181\u2013185."},{"key":"e_1_3_4_71_2","first-page":"371","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Su Li","year":"2018","unstructured":"Li Su. 2018. Vocal melody extraction using patch-based CNN. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 371\u2013375."},{"issue":"2","key":"e_1_3_4_72_2","doi-asserted-by":"crossref","first-page":"99","DOI":"10.1002\/oti.227","article-title":"Movement-to-music computer technology: A developmental play experience for children with severe physical disabilities","volume":"14","author":"Tam Cynthia","year":"2007","unstructured":"Cynthia Tam, Heidi Schwellnus, Ceilidh Eaton, Yani Hamdani, Andrea Lamont, and Tom Chau. 2007. Movement-to-music computer technology: A developmental play experience for children with severe physical disabilities. Occupational Therapy International 14, 2 (2007), 99\u2013112.","journal-title":"Occupational Therapy International"},{"key":"e_1_3_4_73_2","first-page":"6105","volume-title":"International Conference on Machine Learning","author":"Tan Mingxing","year":"2019","unstructured":"Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning. PMLR, 6105\u20136114."},{"key":"e_1_3_4_74_2","first-page":"3927","volume-title":"Proceedings of the 29th ACM International Conference on Multimedia","author":"Tao Ruijie","year":"2021","unstructured":"Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, and Haizhou Li. 2021. Is someone speaking? Exploring long-term temporal features for audio-visual active speaker detection. In Proceedings of the 29th ACM International Conference on Multimedia. 3927\u20133935."},{"key":"e_1_3_4_75_2","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing systems 30 (2017). 5998\u20136008."},{"key":"e_1_3_4_76_2","first-page":"276","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Wang Jun-You","year":"2021","unstructured":"Jun-You Wang and Jyh-Shing Roger Jang. 2021. On the preparation and validation of a large-scale dataset of singing transcription. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 276\u2013280."},{"key":"e_1_3_4_77_2","doi-asserted-by":"publisher","DOI":"10.1109\/MMUL.2008.49"},{"key":"e_1_3_4_78_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSTSP.2017.2763455"},{"key":"e_1_3_4_79_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01070"},{"key":"e_1_3_4_80_2","doi-asserted-by":"crossref","unstructured":"Weiming Yang Xianke Wang Bowen Tian Wei Xu and Wenqing Cheng. 2022. A multi-stage automatic evaluation system for sight-singing. IEEE Transactions on Multimedia 24 (2022) 3881\u20133893.","DOI":"10.1109\/TMM.2022.3168132"},{"key":"e_1_3_4_81_2","unstructured":"Chen Zhang Jiaxing Yu LuChin Chang Xu Tan Jiawei Chen Tao Qin and Kejun Zhang. 2022. PDAugment: Data aug mentation by pitch and duration adjustments for automatic lyrics transcription. In Proceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR). 454\u2013461."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3651310","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3651310","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T22:49:54Z","timestamp":1750286994000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3651310"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,16]]},"references-count":80,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2024,7,31]]}},"alternative-id":["10.1145\/3651310"],"URL":"https:\/\/doi.org\/10.1145\/3651310","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,5,16]]},"assertion":[{"value":"2023-07-29","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-02-26","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-05-16","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}