{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2022,4,2]],"date-time":"2022-04-02T13:50:36Z","timestamp":1648907436492},"reference-count":30,"publisher":"Institute of Electronics, Information and Communications Engineers (IEICE)","issue":"10","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["IEICE Trans. Inf. &amp; Syst."],"published-print":{"date-parts":[[2015]]},"DOI":"10.1587\/transinf.2014edp7430","type":"journal-article","created":{"date-parts":[[2015,9,30]],"date-time":"2015-09-30T18:07:40Z","timestamp":1443636460000},"page":"1799-1807","source":"Crossref","is-referenced-by-count":4,"title":["Acoustic Event Detection in Speech Overlapping Scenarios Based on High-Resolution Spectral Input and Deep Learning"],"prefix":"10.1587","volume":"E98.D","author":[{"given":"Miquel","family":"ESPI","sequence":"first","affiliation":[{"name":"NTT Communication Science Laboratories, NTT Corporation"}]},{"given":"Masakiyo","family":"FUJIMOTO","sequence":"additional","affiliation":[{"name":"NTT Communication Science Laboratories, NTT Corporation"}]},{"given":"Tomohiro","family":"NAKATANI","sequence":"additional","affiliation":[{"name":"NTT Communication Science Laboratories, NTT Corporation"}]}],"member":"532","reference":[{"key":"1","doi-asserted-by":"crossref","unstructured":"[1] D. Mostefa, N. Moreau, K. Choukri, G. Potamianos, S. Chu, A. Tyagi, J. Casas, J. Turmo, L. Cristoforetti, F. Tobia, A. Pnevmatikakis, V. Mylonakis, F. Talantzis, S. Burger, R. Stiefelhagen, K. Bernardin, and C. Rochet, \u201cThe CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms,\u201d Language Resources and Evaluation, vol.41, no.3-4, pp.389-407, 2007.","DOI":"10.1007\/s10579-007-9054-4"},{"key":"2","doi-asserted-by":"crossref","unstructured":"[2] D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. Lagrange, and M. Plumbley, \u201cDetection and classification of acoustic scenes and events: An IEEE AASP challenge,\u201d 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp.1-4, 2013.","DOI":"10.1109\/WASPAA.2013.6701819"},{"key":"3","unstructured":"[3] K. Imoto, S. Shimauchi, H. Uematsu, and H. Ohmuro, \u201cUser activity estimation method based on probabilistic generative model of acoustic event sequence with user activity and its subordinate categories,\u201d INTERSPEECH&apos;2013, pp.2609-2613, 2013."},{"key":"4","doi-asserted-by":"crossref","unstructured":"[4] C. Canton-Ferrer, T. Butko, C. Segura, X. Giro, C. Nadeu, J. Hernando, and J. Casas, \u201cAudiovisual event detection towards scene understanding,\u201d IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR), pp.81-88, 2009.","DOI":"10.1109\/CVPR.2009.5204264"},{"key":"5","doi-asserted-by":"crossref","unstructured":"[5] T. Hori, S. Araki, T. Yoshioka, M. Fujimoto, S. Watanabe, T. Oba, A. Ogawa, K. Otsuka, D. Mikami, K. Kinoshita, T. Nakatani, A. Nakamura, and J. Yamato, \u201cLow-latency real-time meeting recognition and understanding using distant microphones and omni-directional camera,\u201d IEEE Trans. Audio, Speech, Language Process., vol.20, no.2, pp.499-513, 2012.","DOI":"10.1109\/TASL.2011.2164527"},{"key":"6","doi-asserted-by":"crossref","unstructured":"[6] A. Ozerov, A. Liutkus, R. Badeau, and G. Richard, \u201cInformed source separation: source coding meets source separation,\u201d 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp.257-260, IEEE, 2011.","DOI":"10.1109\/ASPAA.2011.6082285"},{"key":"7","unstructured":"[7] A. Harma, M. McKinney, and J. Skowronek, \u201cAutomatic surveillance of the acoustic activity in our living environment,\u201d IEEE International Conference on Multimedia and Expo (ICME), pp.634-637, 2005."},{"key":"8","doi-asserted-by":"crossref","unstructured":"[8] R. Cai, L. Lu, A. Hanjalic, H.-J. Zhang, and L.-H. Cai, \u201cA flexible framework for key audio effects detection and auditory con-text inference,\u201d IEEE Transactions on Audio, Speech and Language Processing, vol.14, no.3, pp.1026-1039, 2006.","DOI":"10.1109\/TSA.2005.857575"},{"key":"9","doi-asserted-by":"crossref","unstructured":"[9] M. Xu, C. Xu, L. Duan, J. Jin, and S. Luo, \u201cAudio key-words generation for sports video analysis,\u201d ACM Transactions on Multimedia Computing, Communications, and Applications, vol.4, no.2, pp.1-23, 2008.","DOI":"10.1145\/1352012.1352015"},{"key":"10","unstructured":"[10] Y.-T. Peng, C.-Y. Lin, M.-T. Sun, and K.-C. Tsai, \u201cHealthcare audio event classification using hidden Markov models and hierarchical hidden Markov models,\u201d IEEE International Conference on Multimedia and Expo (ICME), pp.1218-1221, 2009."},{"key":"11","doi-asserted-by":"crossref","unstructured":"[11] M. Shah, B. Mears, C. Chakrabarti, and A. Spanias, \u201cLifelogging: Archival and retrieval of continuously recorded audio using wearable devices,\u201d Emerging Signal Processing Applications, pp.99-102, 2012.","DOI":"10.1109\/ESPA.2012.6152455"},{"key":"12","doi-asserted-by":"crossref","unstructured":"[12] G. Wichern, J. Xue, H. Thornburg, B. Mechtley, and A. Spanias, \u201cSegmentation, indexing, and retrieval for environmental and natural sounds,\u201d IEEE Trans. Audio, Speech, Language Process., vol.18, no.3, pp.688-707, 2010.","DOI":"10.1109\/TASL.2010.2041384"},{"key":"13","unstructured":"[13] X. Zhuang, X. Zhou, M. Hasegawa-Johnson, and T. Huang, \u201cReal-world acoustic event detection,\u201d Pattern Recognition Letters, vol.31, no.12, pp.1543-1551, 2010."},{"key":"14","unstructured":"[14] M. Espi, M. Fujimoto, D. Saito, N. Ono, and S. Sagayama, \u201cA tandem connectionist model using combination of multi-scale spectro-temporal features for acoustic event detection,\u201d 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.4293-4296, 2012."},{"key":"15","doi-asserted-by":"crossref","unstructured":"[15] S. Araki, T. Nakatani, and H. Sawada, \u201cSimultaneous clustering of mixing and spectral model parameters for blind sparse source separation,\u201d 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp.5-8, 2010.","DOI":"10.1109\/ICASSP.2010.5496283"},{"key":"16","doi-asserted-by":"crossref","unstructured":"[16] T. Nakatani and S. Araki, \u201cSingle channel source separation based on sparse source observation model with harmonic constraint,\u201d 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp.13-16, 2010.","DOI":"10.1109\/ICASSP.2010.5496273"},{"key":"17","doi-asserted-by":"crossref","unstructured":"[17] A.-R. Mohamed, G. Dahl, and G. Hinton, \u201cAcoustic modeling using deep belief networks,\u201d IEEE Trans. Audio, Speech, Language Process., vol.20, no.1, pp.14-22, 2012.","DOI":"10.1109\/TASL.2011.2109382"},{"key":"18","unstructured":"[18] Z. Kons and O. Toledo-Ronen, \u201cAudio event classification using deep neural networks,\u201d INTERSPEECH&apos;2013, pp.1482-1486, 2013."},{"key":"19","unstructured":"[19] G. Hinton, \u201cA practical guide to training restricted boltzmann machines,\u201d Technical report 2010-003, Machine Learning Group-University of Toronto, 2010."},{"key":"20","doi-asserted-by":"crossref","unstructured":"[20] M. Espi, M. Fujimoto, Y. Kubo, and T. Nakatani, \u201cSpectrogram patch based acoustic event detection and classification in speech overlapping conditions,\u201d 2014 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), pp.117-121, May 2014.","DOI":"10.1109\/HSCMA.2014.6843263"},{"key":"21","unstructured":"[21] Y. Ohishi, D. Mochihashi, T. Matsui, M. Nakano, H. Kameoka, T. Izumitani, and K. Kashino, \u201cBayesian semi-supervised audio event transcription based on Markov indian buffet process,\u201d 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.3163-3167, 2013."},{"key":"22","unstructured":"[22] H. Hermansky, D.P.W. Ellis, and S. Sharma, \u201cTandem connectionist feature extraction for conventional HMM systems,\u201d 2000 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.1635-1638, 2000."},{"key":"23","doi-asserted-by":"crossref","unstructured":"[23] T. Heittola, A. Mesaros, A. Eronen, and T. Virtanen, \u201cContext-dependent sound event detection,\u201d EURASIP Journal on Audio, Speech, and Music Processing, vol.2013, no.1, pp.1-13, 2013.","DOI":"10.1186\/1687-4722-2013-1"},{"key":"24","unstructured":"[24] L. Deng, M.L. Seltzer, D. Yu, A. Acero, A. Mohamed, and G. Hinton, \u201cBinary coding of speech spectrograms using a deep auto-encoder,\u201d INTERSPEECH&apos;2010, pp.1692-1695, 2010."},{"key":"25","doi-asserted-by":"crossref","unstructured":"[25] Y. Lu and P.C. Loizou, \u201cEstimators of the magnitude-squared spectrum and methods for incorporating SNR uncertainty,\u201d IEEE Trans. Audio, Speech, Language Process., vol.19, no.5, pp.1123-1137, 2011.","DOI":"10.1109\/TASL.2010.2082531"},{"key":"26","unstructured":"[26] A. Papoulis and S. Pillai, Probability, random variables and stochastic processes with errata sheet, McGraw-Hill Education, New York, NY, 2002."},{"key":"27","doi-asserted-by":"crossref","unstructured":"[27] R. Stiefelhagen, K. Bernardin, R. Bowers, R.T. Rose, M. Michel, and J. Garofolo, \u201cThe CLEAR 2007 evaluation,\u201d in Multimodal Technologies for Perception of Humans, pp.3-34, Springer, 2008.","DOI":"10.1007\/978-3-540-68585-2_1"},{"key":"28","unstructured":"[28] M. Delcroix, T. Yoshioka, A. Ogawa, Y. Kubo, M. Fujimoto, N. Ito, K. Kinoshita, M. Espi, T. Hori, T. Nakatani, and A. Nakamura, \u201cLinear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the reverb challenge,\u201d Proc. Reverb Challenge 2014, 2014."},{"key":"29","unstructured":"[29] H. Hirsch and D. Pearce, \u201cAURORA-4.\u201d http:\/\/aurora.hsnr.de\/aurora-4.html"},{"key":"30","unstructured":"[30] S. Thomas, S. Ganapathy, G. Saon, and H. Soltau, \u201cAnalyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions,\u201d 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.2519-2523, May 2014."}],"container-title":["IEICE Transactions on Information and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.jstage.jst.go.jp\/article\/transinf\/E98.D\/10\/E98.D_2014EDP7430\/_pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2019,8,31]],"date-time":"2019-08-31T00:23:55Z","timestamp":1567211035000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.jstage.jst.go.jp\/article\/transinf\/E98.D\/10\/E98.D_2014EDP7430\/_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015]]},"references-count":30,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2015]]}},"URL":"https:\/\/doi.org\/10.1587\/transinf.2014edp7430","relation":{},"ISSN":["0916-8532","1745-1361"],"issn-type":[{"value":"0916-8532","type":"print"},{"value":"1745-1361","type":"electronic"}],"subject":[],"published":{"date-parts":[[2015]]}}}