{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T02:27:29Z","timestamp":1774924049280,"version":"3.50.1"},"reference-count":27,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,2,3]],"date-time":"2021-02-03T00:00:00Z","timestamp":1612310400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,2,3]],"date-time":"2021-02-03T00:00:00Z","timestamp":1612310400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Hopelab Small Grant"},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["1R01HD069498"],"award-info":[{"award-number":["1R01HD069498"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100011569","name":"Mind and Life Institute","doi-asserted-by":"publisher","award":["2015-1440-Polsinelli"],"award-info":[{"award-number":["2015-1440-Polsinelli"]}],"id":[{"id":"10.13039\/100011569","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J AUDIO SPEECH MUSIC PROC."],"published-print":{"date-parts":[[2021,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Over the recent years, machine learning techniques have been employed to produce state-of-the-art results in several audio related tasks. The success of these approaches has been largely due to access to large amounts of open-source datasets and enhancement of computational resources. However, a shortcoming of these methods is that they often fail to generalize well to tasks from real life scenarios, due to domain mismatch. One such task is foreground speech detection from wearable audio devices. Several interfering factors such as dynamically varying environmental conditions, including background speakers, TV, or radio audio, render foreground speech detection to be a challenging task. Moreover, obtaining precise moment-to-moment annotations of audio streams for analysis and model training is also time-consuming and costly. In this work, we use multiple instance learning (MIL) to facilitate development of such models using annotations available at a lower time-resolution (coarsely labeled). We show how MIL can be applied to localize foreground speech in coarsely labeled audio and show both bag-level and instance-level results. We also study different pooling methods and how they can be adapted to densely distributed events as observed in our application. Finally, we show improvements using speech activity detection embeddings as features for foreground detection.<\/jats:p>","DOI":"10.1186\/s13636-020-00194-0","type":"journal-article","created":{"date-parts":[[2021,2,3]],"date-time":"2021-02-03T17:03:13Z","timestamp":1612371793000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":13,"title":["Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices"],"prefix":"10.1186","volume":"2021","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0904-0573","authenticated-orcid":false,"given":"Rajat","family":"Hebbar","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Pavlos","family":"Papadopoulos","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ramon","family":"Reyes","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Alexander F.","family":"Danvers","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Angelina J.","family":"Polsinelli","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Suzanne A.","family":"Moseley","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"David A.","family":"Sbarra","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Matthias R.","family":"Mehl","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shrikanth","family":"Narayanan","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2021,2,3]]},"reference":[{"key":"194_CR1","doi-asserted-by":"publisher","first-page":"1538","DOI":"10.1109\/TBME.2014.2309951","volume":"61","author":"Y. Zheng","year":"2014","unstructured":"Y. Zheng, X. Ding, C. Poon, B. Lo, H. Zhang, X. Zhou, G. -Z. Yang, N. Zhao, Y. -T. Zhang, Unobtrusive sensing and wearable devices for health informatics. IEEE Trans. Biomed. Eng.61:, 1538\u20131554 (2014). https:\/\/doi.org\/10.1109\/TBME.2014.2309951.","journal-title":"IEEE Trans. Biomed. Eng."},{"issue":"6","key":"194_CR2","doi-asserted-by":"publisher","first-page":"3119","DOI":"10.1109\/JSEN.2014.2357257","volume":"15","author":"M. M. Rodgers","year":"2015","unstructured":"M. M. Rodgers, V. M. Pai, R. S. Conroy, Recent advances in wearable sensors for health monitoring. IEEE Sensors J.15(6), 3119\u20133126 (2015). https:\/\/doi.org\/10.1109\/JSEN.2014.2357257.","journal-title":"IEEE Sensors J."},{"issue":"1","key":"194_CR3","doi-asserted-by":"publisher","first-page":"204","DOI":"10.1037\/pspp0000245","volume":"119","author":"G. M. Harari","year":"2019","unstructured":"G. M. Harari, S. R. M\u00fcller, C. Stachl, R. Wang, W. Wang, M. B\u00fchner, P. J. Rentfrow, A. T. Campbell, S. D. Gosling, Sensing sociability: individual differences in young adults\u2019 conversation, calling, texting, and app use behaviors in daily life. J. Pers. Soc. Psychol.119(1), 204\u2013228 (2019).","journal-title":"J. Pers. Soc. Psychol."},{"issue":"9","key":"194_CR4","doi-asserted-by":"publisher","first-page":"1451","DOI":"10.1177\/0956797618774252","volume":"29","author":"A. Milek","year":"2018","unstructured":"A. Milek, E. A. Butler, A. M. Tackman, D. M. Kaplan, C. L. Raison, D. A. Sbarra, S. Vazire, M. R. Mehl, \u201cEavesdropping on happiness\u201d revisited: a pooled, multisample replication of the association between life satisfaction and observed daily conversation quantity and quality. Psychol. Sci.29(9), 1451\u20131462 (2018).","journal-title":"Psychol. Sci."},{"issue":"6","key":"194_CR5","doi-asserted-by":"publisher","first-page":"1478","DOI":"10.1037\/pspp0000272","volume":"119","author":"J. Sun","year":"2019","unstructured":"J. Sun, K. Harris, S. Vazire, Is well-being associated with the quantity and quality of social interactions?. J. Pers. Soc. Psychol.119(6), 1478\u20131496 (2019).","journal-title":"J. Pers. Soc. Psychol."},{"issue":"1","key":"194_CR6","doi-asserted-by":"publisher","first-page":"30","DOI":"10.1016\/j.bandc.2004.05.003","volume":"56","author":"M. Cannizzaro","year":"2004","unstructured":"M. Cannizzaro, B. Harel, N. Reilly, P. Chappell, P. J. Snyder, Voice acoustical measurement of the severity of major depression. Brain Cogn.56(1), 30\u201335 (2004).","journal-title":"Brain Cogn."},{"issue":"2","key":"194_CR7","doi-asserted-by":"publisher","first-page":"142","DOI":"10.1109\/T-AFFC.2012.38","volume":"4","author":"Y. Yang","year":"2012","unstructured":"Y. Yang, C. Fairbairn, J. F. Cohn, Detecting depression severity from vocal prosody. IEEE Trans. Affect. Comput.4(2), 142\u2013150 (2012).","journal-title":"IEEE Trans. Affect. Comput."},{"issue":"2","key":"194_CR8","doi-asserted-by":"publisher","first-page":"184","DOI":"10.1177\/0963721416680611","volume":"26","author":"M. R. Mehl","year":"2017","unstructured":"M. R. Mehl, The electronically activated recorder (EAR) a method for the naturalistic observation of daily social behavior. Curr. Dir. Psychol. Sci.26(2), 184\u2013190 (2017).","journal-title":"Curr. Dir. Psychol. Sci."},{"key":"194_CR9","doi-asserted-by":"crossref","unstructured":"T. Feng, A. Nadarajan, C. Vaz, B. Booth, S. Narayanan, in Proceedings of the 4th ACM Workshop on Wearable Systems and Applications. Tiles audio recorder: an unobtrusive wearable solution to track audio activity (ACM, 2018), pp. 33\u201338.","DOI":"10.1145\/3211960.3211975"},{"key":"194_CR10","volume-title":"In CSCW\u201902 Workshop: Ad Hoc Communications and Collaboration in Ubiquitous Computing Environments","author":"T. Choudhury","year":"2002","unstructured":"T. Choudhury, A. Pentland, in In CSCW\u201902 Workshop: Ad Hoc Communications and Collaboration in Ubiquitous Computing Environments. The sociometer: a wearable device for understanding human networks (Association for Computing Machinery (ACM)New York, 2002)."},{"key":"194_CR11","doi-asserted-by":"crossref","unstructured":"A. Nadarajan, K. Somandepalli, S. S. Narayanan, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Speaker agnostic foreground speech detection from audio recordings in workplace settings from wearable recorders (IEEE, 2019), pp. 6765\u20136769.","DOI":"10.1109\/ICASSP.2019.8683244"},{"key":"194_CR12","doi-asserted-by":"crossref","unstructured":"J. Li, W. Dai, F. Metze, S. Qu, S. Das, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). A comparison of deep learning methods for environmental sound detection (IEEE, 2017), pp. 126\u2013130.","DOI":"10.1109\/ICASSP.2017.7952131"},{"issue":"1","key":"194_CR13","doi-asserted-by":"publisher","first-page":"189","DOI":"10.1109\/TPAMI.2016.2535231","volume":"39","author":"R. G. Cinbis","year":"2016","unstructured":"R. G. Cinbis, J. Verbeek, C. Schmid, Weakly supervised object localization with multi-fold multiple instance learning. IEEE Trans. Pattern. Anal. Mach. Intell.39(1), 189\u2013203 (2016).","journal-title":"IEEE Trans. Pattern. Anal. Mach. Intell."},{"key":"194_CR14","doi-asserted-by":"crossref","unstructured":"Y. Wang, J. Li, F. Metze, Comparing the max and noisy-or pooling functions in multiple instance learning for weakly supervised sequence learning tasks. arXiv preprint arXiv:1804.01146 (2018).","DOI":"10.21437\/Interspeech.2018-990"},{"key":"194_CR15","unstructured":"Q. Kong, Y. Cao, T. Iqbal, Y. Xu, W. Wang, M. D. Plumbley, Cross-task learning for audio tagging, sound event detection and spatial localization: Dcase 2019 baseline systems. arXiv preprint arXiv:1904.03476 (2019)."},{"key":"194_CR16","doi-asserted-by":"crossref","unstructured":"K. Deepak, B. D. Sarma, S. M. Prasanna, in Thirteenth Annual Conference of the International Speech Communication Association. Foreground speech segmentation using zero frequency filtered signal (International Speech Communication Association (ISCA), 2012).","DOI":"10.21437\/Interspeech.2012-427"},{"key":"194_CR17","doi-asserted-by":"crossref","unstructured":"C. Wang, W. Ren, K. Huang, T. Tan, in European Conference on Computer Vision. Weakly supervised object localization with latent category learning (Springer, 2014), pp. 431\u2013445.","DOI":"10.1007\/978-3-319-10599-4_28"},{"key":"194_CR18","unstructured":"M. Ilse, J. M. Tomczak, M. Welling, Attention-based deep multiple instance learning. arXiv preprint arXiv:1802.04712 (2018)."},{"key":"194_CR19","doi-asserted-by":"crossref","unstructured":"Q. Kong, Y. Xu, W. Wang, M. D. Plumbley, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Audio set classification with attention model: a probabilistic perspective (IEEE, 2018), pp. 316\u2013320.","DOI":"10.1109\/ICASSP.2018.8461392"},{"key":"194_CR20","doi-asserted-by":"crossref","unstructured":"A. Kumar, B. Raj, in 2016 IEEE International Conference on Multimedia and Expo (ICME). Weakly supervised scalable audio content analysis (IEEE, 2016), pp. 1\u20136.","DOI":"10.1109\/ICME.2016.7552989"},{"key":"194_CR21","doi-asserted-by":"crossref","unstructured":"S. -Y. Tseng, J. Li, Y. Wang, J. Szurley, F. Metze, S. Das, Multiple instance deep learning for weakly supervised small-footprint audio event detection. arXiv preprint arXiv:1712.09673 (2017).","DOI":"10.21437\/Interspeech.2018-1120"},{"key":"194_CR22","doi-asserted-by":"crossref","unstructured":"D. Wang, T. F. Zheng, in 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). Transfer learning for speech and language processing (IEEE, 2015), pp. 1225\u20131237.","DOI":"10.1109\/APSIPA.2015.7415532"},{"key":"194_CR23","doi-asserted-by":"crossref","unstructured":"J. Kunze, L. Kirsch, I. Kurenkov, A. Krug, J. Johannsmeier, S. Stober, Transfer learning for speech recognition on a budget. arXiv preprint arXiv:1706.00290 (2017).","DOI":"10.18653\/v1\/W17-2620"},{"key":"194_CR24","doi-asserted-by":"crossref","unstructured":"R. Hebbar, K. Somandepalli, S. Narayanan, in Proc. Interspeech 2018. Improving gender identification in movie audio using cross-domain data, (2018), pp. 282\u2013286. https:\/\/doi.org\/10.21437\/Interspeech.2018-1462. http:\/\/dx.doi.org\/10.21437\/Interspeech.2018-1462.","DOI":"10.21437\/Interspeech.2018-1462"},{"key":"194_CR25","doi-asserted-by":"crossref","unstructured":"A. J. Polsinelli, S. A. Moseley, M. D. Grilli, E. L. Glisky, M. R. Mehl, Natural, everyday language use provides a window intothe integrity of older adults\u2019 executive functioning. J. Gerontol. B. 75(9), e215\u2013e220.","DOI":"10.1093\/geronb\/gbaa055"},{"key":"194_CR26","doi-asserted-by":"publisher","unstructured":"K. O\u2019Hara, A. Grinberg, A. Tackman, M. Mehl, D. Sbarra, Preprint: contact and psychological adjustment following divorce\/separation. Clin. Psychol. Sci. (2019). https:\/\/doi.org\/10.31234\/osf.io\/axhnq.","DOI":"10.31234\/osf.io\/axhnq"},{"key":"194_CR27","doi-asserted-by":"crossref","unstructured":"R. Hebbar, K. Somandepalli, S. Narayanan, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Robust speech activity detection in movie audio: data resources and experimental evaluation (IEEE, 2019), pp. 4105\u20134109.","DOI":"10.1109\/ICASSP.2019.8682532"}],"container-title":["EURASIP Journal on Audio, Speech, and Music Processing"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-020-00194-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1186\/s13636-020-00194-0\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-020-00194-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,14]],"date-time":"2022-12-14T15:56:24Z","timestamp":1671033384000},"score":1,"resource":{"primary":{"URL":"https:\/\/asmp-eurasipjournals.springeropen.com\/articles\/10.1186\/s13636-020-00194-0"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,2,3]]},"references-count":27,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["194"],"URL":"https:\/\/doi.org\/10.1186\/s13636-020-00194-0","relation":{},"ISSN":["1687-4722"],"issn-type":[{"value":"1687-4722","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,2,3]]},"assertion":[{"value":"5 August 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"27 December 2020","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 February 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare that they have no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"7"}}