{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,24]],"date-time":"2026-03-24T11:47:58Z","timestamp":1774352878394,"version":"3.50.1"},"reference-count":42,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2019,2,13]],"date-time":"2019-02-13T00:00:00Z","timestamp":1550016000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2019,2,28]]},"abstract":"<jats:p>Deep cross-modal learning has successfully demonstrated excellent performance in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities, such as audio and lyrics, should be taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Data in different modalities are converted to the same canonical space where intermodal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study that uses deep architectures for learning the temporal correlation between audio and lyrics. A pretrained Doc2Vec model followed by fully connected layers is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: (i) We propose an end-to-end network to learn cross-modal correlation between audio and lyrics, where feature extraction and correlation learning are simultaneously performed and joint representation is learned by considering temporal structures. (ii) And, as for feature extraction, we further represent an audio signal by a short sequence of local summaries (VGG16 features) and apply a recurrent neural network to compute a compact feature that better learns the temporal structures of music audio. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval.<\/jats:p>","DOI":"10.1145\/3281746","type":"journal-article","created":{"date-parts":[[2019,2,14]],"date-time":"2019-02-14T19:36:17Z","timestamp":1550172977000},"page":"1-16","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":74,"title":["Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval"],"prefix":"10.1145","volume":"15","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0294-6620","authenticated-orcid":false,"given":"Yi","family":"Yu","sequence":"first","affiliation":[{"name":"National Institute of Informatics, Chiyoda-ku, Tokyo, Japan"}]},{"given":"Suhua","family":"Tang","sequence":"additional","affiliation":[{"name":"The University of Electro-Communications, Chofu, Tokyo, Japan"}]},{"given":"Francisco","family":"Raposo","sequence":"additional","affiliation":[{"name":"Universidade de Lisboa, Lisboa, Portugal"}]},{"given":"Lei","family":"Chen","sequence":"additional","affiliation":[{"name":"Hong Kong University of Science and Technology, Kowloon, Hong Kong"}]}],"member":"320","published-online":{"date-parts":[[2019,2,13]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-04114-8_26"},{"key":"e_1_2_1_2_1","volume-title":"Proceedings of the 30th International Conference on International Conference on Machine Learning -","volume":"28","author":"Andrew Galen","year":"2013"},{"key":"e_1_2_1_3_1","volume-title":"Proceedings of Workshop on Artificial Intelligence and Statistics.","author":"Brochu Eric","year":"2003"},{"key":"e_1_2_1_4_1","volume-title":"Proceedings of the 31st AAAI Conference on Artificial Intelligence","author":"Cao Yue","year":"2017"},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of the 17th International Society for Music Information Retrieval Conference, ISMIR 2016, New York City","author":"Choi Keunwoo","year":"2016"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.asoc.2016.12.024"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2007.890831"},{"key":"e_1_2_1_8_1","volume-title":"Proceedings of 14th International Conference on Music Information Retrieval (ISMIR'13)","author":"Hamel Philippe","year":"2013"},{"key":"e_1_2_1_9_1","volume-title":"Multi-view recurrent neural acoustic word embeddings. CoRR abs\/1611.04496","author":"He Wanjia","year":"2016"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2017.7952132"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_2_1_12_1","volume-title":"Deep learning for content-based, cross-modal retrieval of videos and music. CoRR abs\/1704.06761","author":"Hong Sungeun","year":"2017"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/28.3-4.321"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.767"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.348"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3065386"},{"key":"e_1_2_1_17_1","volume-title":"An empirical evaluation of doc2vec with practical insights into document embedding generation. CoRR abs\/1607.05368","author":"Lau Jey Han","year":"2016"},{"key":"e_1_2_1_18_1","volume-title":"Proceedings of the 22nd International Conference on Neural Information Processing Systems (NIPS\u201909)","author":"Lee Honglak"},{"key":"e_1_2_1_19_1","volume-title":"Keunhyoung Luke Kim, and Juhan Nam","author":"Lee Jongpil","year":"2017"},{"key":"e_1_2_1_20_1","volume-title":"Proceedings of the European Conference on Computer Vision, Zurich. 740--755","author":"Lin Tsung-Yi"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/TAFFC.2016.2598569"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P14-5010"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.5555\/2026666.2026708"},{"key":"e_1_2_1_24_1","volume-title":"Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR\u201911)","author":"McVicar Matt","year":"2011"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.5555\/2390948.2391015"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2015.09.018"},{"key":"e_1_2_1_27_1","volume-title":"The origins of music","author":"Nettl Bruno"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/1873951.1873987"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654919"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2014.6854949"},{"key":"e_1_2_1_31_1","volume-title":"Very deep convolutional networks for large-scale image recognition. CoRR abs\/1409.1556","author":"Simonyan Karen","year":"2014"},{"key":"e_1_2_1_32_1","volume-title":"Lyrics-based music genre classification using a hierarchical attention network. CoRR abs\/1707.04678","author":"Tsaptsinos Alexandros","year":"2017"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2016.2557722"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298966"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/1631272.1631320"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/1873951.1874004"},{"key":"e_1_2_1_37_1","volume-title":"Video captioning and retrieval models with semantic attention. CoRR abs\/1610.02947","author":"Yu Youngjae","year":"2016"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2393347.2396493"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISM.2017.50"},{"key":"e_1_2_1_40_1","volume-title":"Category-based deep CCA for fine-grained venue discovery from multimodal data","author":"Yu Yi","year":"2018"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2013.2269313"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-63579-8_14"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3281746","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3281746","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T01:02:10Z","timestamp":1750208530000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3281746"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,2,13]]},"references-count":42,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2019,2,28]]}},"alternative-id":["10.1145\/3281746"],"URL":"https:\/\/doi.org\/10.1145\/3281746","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,2,13]]},"assertion":[{"value":"2018-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-09-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-02-13","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}