{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,27]],"date-time":"2025-10-27T16:13:31Z","timestamp":1761581611656,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":48,"publisher":"ACM","license":[{"start":{"date-parts":[[2017,10,19]],"date-time":"2017-10-19T00:00:00Z","timestamp":1508371200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"1000 plan","award":["11150087963001"],"award-info":[{"award-number":["11150087963001"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2017,10,19]]},"DOI":"10.1145\/3123266.3123313","type":"proceedings-article","created":{"date-parts":[[2017,10,20]],"date-time":"2017-10-20T13:04:26Z","timestamp":1508504666000},"page":"1192-1200","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":77,"title":["Enhancing Micro-video Understanding by Harnessing External Sounds"],"prefix":"10.1145","author":[{"given":"Liqiang","family":"Nie","sequence":"first","affiliation":[{"name":"ShanDong University, Jinan, China"}]},{"given":"Xiang","family":"Wang","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}]},{"given":"Jianglong","family":"Zhang","sequence":"additional","affiliation":[{"name":"Communication University of China, Beijing, China"}]},{"given":"Xiangnan","family":"He","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}]},{"given":"Hanwang","family":"Zhang","sequence":"additional","affiliation":[{"name":"Columbia University, New York, NY, USA"}]},{"given":"Richang","family":"Hong","sequence":"additional","affiliation":[{"name":"Hefei University of Technology, Hefei, China"}]},{"given":"Qi","family":"Tian","sequence":"additional","affiliation":[{"name":"University of Texas at San Antonio, San Antonio, TX, USA"}]}],"member":"320","published-online":{"date-parts":[[2017,10,19]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/2671188.2749396"},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2015.2496275"},{"key":"e_1_3_2_1_3_1","volume-title":"Noisemes: Manual annotation of environmental noise in audio streams. Technical report Carnegie Mellon University-LTI-12-07","author":"Burger Susanne","year":"2012","unstructured":"Susanne Burger , Qin Jin , Peter F Schulam , and Florian Metze . 2012 . Noisemes: Manual annotation of environmental noise in audio streams. Technical report Carnegie Mellon University-LTI-12-07 (2012), 1--5. Susanne Burger, Qin Jin, Peter F Schulam, and Florian Metze. 2012. Noisemes: Manual annotation of environmental noise in audio streams. Technical report Carnegie Mellon University-LTI-12-07 (2012), 1--5."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2013.96"},{"key":"e_1_3_2_1_5_1","unstructured":"Diego Castan and Murat Akbacak. 2013. Segmental-GMM Approach based on Acoustic Concept Segmentation SLAM@ INTERSPEECH. 15--19.  Diego Castan and Murat Akbacak. 2013. Segmental-GMM Approach based on Acoustic Concept Segmentation SLAM@ INTERSPEECH. 15--19."},{"key":"e_1_3_2_1_6_1","unstructured":"Sourish Chaudhuri and Bhiksha Raj. 2012. Unsupervised structure discovery for semantic analysis of audio NIPS. 1178--1186.   Sourish Chaudhuri and Bhiksha Raj. 2012. Unsupervised structure discovery for semantic analysis of audio NIPS. 1178--1186."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2964314"},{"key":"e_1_3_2_1_8_1","unstructured":"Ning Chen Jun Zhu and Eric P Xing. 2010. Predictive subspace learning for multi-view data: a large margin approach NIPS. 361--369.   Ning Chen Jun Zhu and Eric P Xing. 2010. Predictive subspace learning for multi-view data: a large margin approach NIPS. 361--369."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICME.2012.141"},{"key":"e_1_3_2_1_10_1","unstructured":"Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzeng and Trevor Darrell. 2014. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition ICML. 647--655.   Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzeng and Trevor Darrell. 2014. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition ICML. 647--655."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2006.881969"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3077136.3080773"},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2072609.2072619"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2487575.2487644"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"crossref","unstructured":"James Hays and Alexei A Efros. 2008. IM2GPS: estimating geographic information from a single image CVPR. 1--8.  James Hays and Alexei A Efros. 2008. IM2GPS: estimating geographic information from a single image CVPR. 1--8.","DOI":"10.1109\/CVPR.2008.4587784"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3038912.3052569"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/2911451.2911489"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"crossref","unstructured":"Adam Kilgarriff and Christiane Fellbaum. 2000. WordNet: An Electronic Lexical Database. (2000).  Adam Kilgarriff and Christiane Fellbaum. 2000. WordNet: An Electronic Lexical Database. (2000).","DOI":"10.2307\/417141"},{"key":"e_1_3_2_1_19_1","volume-title":"Hinton","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky , Ilya Sutskever , and Geoffrey E . Hinton . 2012 . ImageNet Classification with Deep Convolutional Neural Networks NIPS. 1106--1114. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks NIPS. 1106--1114."},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2016.2540802"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2537337"},{"key":"e_1_3_2_1_22_1","unstructured":"Gaowen Liu Yan Yan Elisa Ricci Yi Yang Yahong Han Stefan Winkler and Nicu Sebe. 2015. Inferring Painting Style with Multi-task Dictionary Learning IJCAI. 2162--2168.   Gaowen Liu Yan Yan Elisa Ricci Yi Yang Yahong Han Stefan Winkler and Nicu Sebe. 2015. Inferring Painting Style with Multi-task Dictionary Learning IJCAI. 2162--2168."},{"key":"e_1_3_2_1_23_1","volume-title":"Jordan","author":"Long Mingsheng","year":"2015","unstructured":"Mingsheng Long , Yue Cao , Jianmin Wang , and Michael I . Jordan . 2015 . Learning Transferable Features with Deep Adaptation Networks ICML. 97--105. Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. 2015. Learning Transferable Features with Deep Adaptation Networks ICML. 97--105."},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2011.156"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1553374.1553463"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2011.156"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2007.911828"},{"key":"e_1_3_2_1_28_1","volume-title":"Bach","author":"Mairal Julien","year":"2009","unstructured":"Julien Mairal , Jean Ponce , Guillermo Sapiro , Andrew Zisserman , and Francis R . Bach . 2009 . Supervised Dictionary Learning. NIPS. 1033--1040. Julien Mairal, Jean Ponce, Guillermo Sapiro, Andrew Zisserman, and Francis R. Bach. 2009. Supervised Dictionary Learning. NIPS. 1033--1040."},{"key":"e_1_3_2_1_29_1","unstructured":"Annamaria Mesaros Toni Heittola Antti Eronen and Tuomas Virtanen. 2010 a. Acoustic event detection in real life recordings. EUSIPCO. 1267--1271.  Annamaria Mesaros Toni Heittola Antti Eronen and Tuomas Virtanen. 2010 a. Acoustic event detection in real life recordings. EUSIPCO. 1267--1271."},{"key":"e_1_3_2_1_30_1","unstructured":"Annamaria Mesaros Toni Heittola Antti J. Eronen and Tuomas Virtanen. 2010 b. Acoustic event detection in real life recordings. EUSIPCO. 1267--1271.  Annamaria Mesaros Toni Heittola Antti J. Eronen and Tuomas Virtanen. 2010 b. Acoustic event detection in real life recordings. EUSIPCO. 1267--1271."},{"key":"e_1_3_2_1_31_1","unstructured":"Tomas Mikolov Ilya Sutskever Kai Chen Greg Corrado and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and Their Compositionality NIPS. 3111--3119.   Tomas Mikolov Ilya Sutskever Kai Chen Greg Corrado and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and Their Compositionality NIPS. 3111--3119."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2007.901813"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2390214.2390219"},{"key":"e_1_3_2_1_34_1","unstructured":"Mirco Ravanelli Benjamin Elizalde Karl Ni and Gerald Friedland. 2014. Audio concept classification with hierarchical deep neural networks EUSIPCO. 606--610.  Mirco Ravanelli Benjamin Elizalde Karl Ni and Gerald Friedland. 2014. Audio concept classification with hierarchical deep neural networks EUSIPCO. 606--610."},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"crossref","unstructured":"S. Sadanand and J. J. Corso. 2012. Action bank: A high-level representation of activity in video CVPR. 1234--1241.   S. Sadanand and J. J. Corso. 2012. Action bank: A high-level representation of activity in video CVPR. 1234--1241.","DOI":"10.1109\/CVPR.2012.6247806"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/2766462.2767726"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.5555\/1641661.1641671"},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2012.2207397"},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"crossref","unstructured":"Xiang Wang Xiangnan He Liqiang Nie and Tat-Seng Chua. 2017. Item Silk Road: Recommending Items from Information Domains to Social Users. (2017).  Xiang Wang Xiangnan He Liqiang Nie and Tat-Seng Chua. 2017. Item Silk Road: Recommending Items from Information Domains to Social Users. (2017).","DOI":"10.1145\/3077136.3080771"},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3052774"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"crossref","unstructured":"Yipei Wang Shourabh Rawat and Florian Metze. 2014. Exploring audio semantic concepts for event-based video retrieval ICASSP. 1360--1364.  Yipei Wang Shourabh Rawat and Florian Metze. 2014. Exploring audio semantic concepts for event-based video retrieval ICASSP. 1360--1364.","DOI":"10.1109\/ICASSP.2014.6853819"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"crossref","unstructured":"Meng Yang Weiyang Liu Weixin Luo and Linlin Shen. 2016 b. Analysis-Synthesis Dictionary Learning for Universality-Particularity Representation Based Classification. In AAAI. 2251--2257.   Meng Yang Weiyang Liu Weixin Luo and Linlin Shen. 2016 b. Analysis-Synthesis Dictionary Learning for Universality-Particularity Representation Based Classification. In AAAI. 2251--2257.","DOI":"10.1609\/aaai.v30i1.10219"},{"key":"e_1_3_2_1_43_1","unstructured":"Zhilin Yang William W. Cohen and Ruslan Salakhutdinov. 2016 a. Revisiting Semi-Supervised Learning with Graph Embeddings ICML. 40--48.   Zhilin Yang William W. Cohen and Ruslan Salakhutdinov. 2016 a. Revisiting Semi-Supervised Learning with Graph Embeddings ICML. 40--48."},{"key":"e_1_3_2_1_44_1","unstructured":"Jason Yosinski Jeff Clune Yoshua Bengio and Hod Lipson. 2014. How transferable are features in deep neural networks? NIPS. 3320--3328.   Jason Yosinski Jeff Clune Yoshua Bengio and Hod Lipson. 2014. How transferable are features in deep neural networks? NIPS. 3320--3328."},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"crossref","unstructured":"Hanwang Zhang Zawlin Kyaw Shih-Fu Chang and Tat-Seng Chua. 2017. Visual Translation Embedding Network for Visual Relation Detection CVPR.  Hanwang Zhang Zawlin Kyaw Shih-Fu Chang and Tat-Seng Chua. 2017. Visual Translation Embedding Network for Visual Relation Detection CVPR.","DOI":"10.1109\/CVPR.2017.331"},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2014.2325784"},{"key":"e_1_3_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2964307"},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"crossref","unstructured":"Yueting Zhuang Yanfei Wang Fei Wu Yin Zhang and Weiming Lu. 2013. Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval AAAI. 1070--1076.   Yueting Zhuang Yanfei Wang Fei Wu Yin Zhang and Weiming Lu. 2013. Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval AAAI. 1070--1076.","DOI":"10.1609\/aaai.v27i1.8603"}],"event":{"name":"MM '17: ACM Multimedia Conference","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Mountain View California USA","acronym":"MM '17"},"container-title":["Proceedings of the 25th ACM international conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3123266.3123313","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3123266.3123313","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T03:39:28Z","timestamp":1750217968000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3123266.3123313"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,10,19]]},"references-count":48,"alternative-id":["10.1145\/3123266.3123313","10.1145\/3123266"],"URL":"https:\/\/doi.org\/10.1145\/3123266.3123313","relation":{},"subject":[],"published":{"date-parts":[[2017,10,19]]},"assertion":[{"value":"2017-10-19","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}