{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:38:09Z","timestamp":1750307889547,"version":"3.41.0"},"reference-count":32,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2010,8,1]],"date-time":"2010-08-01T00:00:00Z","timestamp":1280620800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000144","name":"Division of Computer and Network Systems","doi-asserted-by":"publisher","award":["CNS-07-16293CNS-07-51078"],"award-info":[{"award-number":["CNS-07-16293CNS-07-51078"]}],"id":[{"id":"10.13039\/100000144","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2010,8]]},"abstract":"<jats:p>We investigate the challenging issue of joint audio-visual analysis of generic videos targeting at concept detection. We extract a novel local representation, Audio-Visual Atom (AVA), which is defined as a region track associated with regional visual features and audio onset features. We develop a hierarchical algorithm to extract visual atoms from generic videos, and locate energy onsets from the corresponding soundtrack by time-frequency analysis. Audio atoms are extracted around energy onsets. Visual and audio atoms form AVAs, based on which discriminative audio-visual codebooks are constructed for concept detection. Experiments over Kodak's consumer benchmark videos confirm the effectiveness of our approach.<\/jats:p>","DOI":"10.1145\/1823746.1823748","type":"journal-article","created":{"date-parts":[[2010,8,31]],"date-time":"2010-08-31T13:05:55Z","timestamp":1283259955000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["Audio-visual atoms for generic video concept classification"],"prefix":"10.1145","volume":"6","author":[{"given":"Wei","family":"Jiang","sequence":"first","affiliation":[{"name":"Columbia University, New York, NY"}]},{"given":"Courtenay","family":"Cotton","sequence":"additional","affiliation":[{"name":"Columbia University, New York, NY"}]},{"given":"Shih-Fu","family":"Chang","sequence":"additional","affiliation":[{"name":"Columbia University, New York, NY"}]},{"given":"Dan","family":"Ellis","sequence":"additional","affiliation":[{"name":"Columbia University, New York, NY"}]},{"given":"Alexander C.","family":"Loui","sequence":"additional","affiliation":[{"name":"Eastman Kodak Company, Rochester, NY"}]}],"member":"320","published-online":{"date-parts":[[2010,8,27]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the International Conference on Computational Science.","author":"Anemueller J.","year":"2008","unstructured":"Anemueller , J. , Bach , J. , Caputo , B. , 2008 . Biologically motivated audio-visual cue integration for object categorization . In Proceedings of the International Conference on Computational Science. Anemueller, J., Bach, J., Caputo, B., et al. 2008. Biologically motivated audio-visual cue integration for object categorization. In Proceedings of the International Conference on Computational Science."},{"volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.","author":"Barzelay Z.","key":"e_1_2_1_2_1","unstructured":"Barzelay , Z. and Schechner , Y . 2007. Harmony in motion . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Barzelay, Z. and Schechner, Y. 2007. Harmony in motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2003.1206512"},{"key":"e_1_2_1_4_1","volume-title":"Kit: An implementation of the Kanade-Lucas-Tomasi feature tracker","author":"Birchfeld S.","year":"2007","unstructured":"Birchfeld , S. 2007 . Kit: An implementation of the Kanade-Lucas-Tomasi feature tracker . http:\/\/vision.stanford.eduj\/~birch. Birchfeld, S. 2007. Kit: An implementation of the Kanade-Lucas-Tomasi feature tracker. http:\/\/vision.stanford.eduj\/~birch."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/1290082.1290118"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.5555\/1005332.1016789"},{"volume-title":"Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1--4.","author":"Chu S.","key":"e_1_2_1_7_1","unstructured":"Chu , S. and Narayanan , S . 2008. Environmental sound recognition using mp-based features . In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1--4. Chu, S. and Narayanan, S. 2008. Environmental sound recognition using mp-based features. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1--4."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2006.886263"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/957013.957124"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/34.946985"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1214\/aos\/1016218223"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1282280.1282344"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2005.239"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1155\/2007\/64506"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2003.1233903"},{"volume-title":"Proceedings of the 2nd European Conference on Computer Vision. 376--387","author":"Kaucic R.","key":"e_1_2_1_17_1","unstructured":"Kaucic , R. , Dalton , B. , and Blake , A . 1996. Real-time lip tracking for audio-visual speech recognition applications . In Proceedings of the 2nd European Conference on Computer Vision. 376--387 . Kaucic, R., Dalton, B., and Blake, A. 1996. Real-time lip tracking for audio-visual speech recognition applications. In Proceedings of the 2nd European Conference on Computer Vision. 376--387."},{"volume-title":"Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 496--499","author":"Krstuloyic S.","key":"e_1_2_1_18_1","unstructured":"Krstuloyic , S. and Grigonyal , R . 2006. MPTK Matching Pursuit made tractable . In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 496--499 . Krstuloyic, S. and Grigonyal, R. 2006. MPTK Matching Pursuit made tractable. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 496--499."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/1290082.1290117"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1023\/B:VISI.0000029664.99615.94"},{"volume-title":"Proceedings of the Imaging Understanding Workshop. 121--130","author":"Lucas B.","key":"e_1_2_1_21_1","unstructured":"Lucas , B. and Kanade , T . 1981. An iterative image registration technique with an application to stereo vision . In Proceedings of the Imaging Understanding Workshop. 121--130 . Lucas, B. and Kanade, T. 1981. An iterative image registration technique with an application to stereo vision. In Proceedings of the Imaging Understanding Workshop. 121--130."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/78.258082"},{"volume-title":"Proceedings of the Conference on Advances in Neural Information Processing Systems. 570--576","author":"Maron O.","key":"e_1_2_1_23_1","unstructured":"Maron , O. and Lozano-Perez , T . 1998. A framework for multiple-instance learning . In Proceedings of the Conference on Advances in Neural Information Processing Systems. 570--576 . Maron, O. and Lozano-Perez, T. 1998. A framework for multiple-instance learning. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 570--576."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-007-0122-4"},{"volume-title":"Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 233--236","author":"Ogle J.","key":"e_1_2_1_25_1","unstructured":"Ogle , J. and Ellis , D . 2007. Fingerprinting to identify repeated sound events in long-duration personal audio recordings . In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 233--236 . Ogle, J. and Ellis, D. 2007. Fingerprinting to identify repeated sound events in long-duration personal audio recordings. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 233--236."},{"key":"e_1_2_1_26_1","unstructured":"Petitcolas F. 2003. Mpeg for matlab. http:\/\/www.petitcolas.net\/fabien\/software.mpeg.  Petitcolas F. 2003. Mpeg for matlab. http:\/\/www.petitcolas.net\/fabien\/software.mpeg."},{"volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 593--600","author":"Shi J.","key":"e_1_2_1_27_1","unstructured":"Shi , J. and Tomasi , C . 1994. Good features to track . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 593--600 . Shi, J. and Tomasi, C. 1994. Good features to track. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 593--600."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/1178677.1178722"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/34.868677"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1023\/B:VISI.0000004830.93820.78"},{"key":"e_1_2_1_31_1","unstructured":"Yanagawa A. Hsu W. and Chang S. 2006. Brief descriptions of visual features for baseline TRECVID concept detectors. Columbia University ADVENT Tech. rep. 219-2006-5.  Yanagawa A. Hsu W. and Chang S. 2006. Brief descriptions of visual features for baseline TRECVID concept detectors. Columbia University ADVENT Tech. rep. 219-2006-5."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2006.250"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2008.08.006"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1823746.1823748","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1823746.1823748","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T14:47:16Z","timestamp":1750258036000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1823746.1823748"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2010,8]]},"references-count":32,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2010,8]]}},"alternative-id":["10.1145\/1823746.1823748"],"URL":"https:\/\/doi.org\/10.1145\/1823746.1823748","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2010,8]]},"assertion":[{"value":"2010-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2010-05-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2010-08-27","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}