{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,3]],"date-time":"2026-04-03T15:25:16Z","timestamp":1775229916525,"version":"3.50.1"},"reference-count":51,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2018,7,30]],"date-time":"2018-07-30T00:00:00Z","timestamp":1532908800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Graph."],"published-print":{"date-parts":[[2018,8,31]]},"abstract":"<jats:p>\n            We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to \"focus\" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVS\n            <jats:sc>peech<\/jats:sc>\n            , a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest).\n          <\/jats:p>","DOI":"10.1145\/3197517.3201357","type":"journal-article","created":{"date-parts":[[2018,7,31]],"date-time":"2018-07-31T15:56:23Z","timestamp":1533052583000},"page":"1-11","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":553,"title":["Looking to listen at the cocktail party"],"prefix":"10.1145","volume":"37","author":[{"given":"Ariel","family":"Ephrat","sequence":"first","affiliation":[{"name":"Google Research and The Hebrew University of Jerusalem, Israel"}]},{"given":"Inbar","family":"Mosseri","sequence":"additional","affiliation":[{"name":"Google Research"}]},{"given":"Oran","family":"Lang","sequence":"additional","affiliation":[{"name":"Google Research"}]},{"given":"Tali","family":"Dekel","sequence":"additional","affiliation":[{"name":"Google Research"}]},{"given":"Kevin","family":"Wilson","sequence":"additional","affiliation":[{"name":"Google Research"}]},{"given":"Avinatan","family":"Hassidim","sequence":"additional","affiliation":[{"name":"Google Research"}]},{"given":"William T.","family":"Freeman","sequence":"additional","affiliation":[{"name":"Google Research"}]},{"given":"Michael","family":"Rubinstein","sequence":"additional","affiliation":[{"name":"Google Research"}]}],"member":"320","published-online":{"date-parts":[[2018,7,30]]},"reference":[{"key":"e_1_2_2_1_1","volume-title":"The Conversation: Deep Audio-Visual Speech Enhancement. In arXiv:1804.04121.","author":"Afouras T.","year":"2018"},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2010.2050650"},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1121\/1.1907229"},{"key":"e_1_2_2_4_1","volume-title":"Lip Reading Sentences in the Wild. CoRR abs\/1611.05358","author":"Chung Joon Son","year":"2016"},{"key":"e_1_2_2_5_1","volume-title":"CVPR'17","author":"Cole Forrester","year":"2016"},{"key":"e_1_2_2_6_1","volume-title":"Handbook of Blind Source Separation: Independent component analysis and applications","author":"Comon Pierre"},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2017.2687829"},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2017.61"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2015.7178061"},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN.2017.7965918"},{"key":"e_1_2_2_11_1","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Gabbay Aviv","year":"2018"},{"key":"e_1_2_2_12_1","volume-title":"Visual Speech Enhancement using Noise-Invariant Training. arXiv preprint arXiv:1711.08789","author":"Gabbay Aviv","year":"2017"},{"key":"e_1_2_2_13_1","doi-asserted-by":"crossref","unstructured":"R Gao R Feris and K. Grauman. 2018. Learning to Separate Object Sounds by Watching Unlabeled Video. arXiv preprint arXiv:1804.01665 (2018).  R Gao R Feris and K. Grauman. 2018. Learning to Separate Object Sounds by Watching Unlabeled Video. arXiv preprint arXiv:1804.01665 (2018).","DOI":"10.1007\/978-3-030-01219-9_3"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2017.7952261"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1523\/JNEUROSCI.3675-12.2013"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2015.2407694"},{"key":"e_1_2_2_17_1","volume-title":"Glass","author":"Harwath David F.","year":"2016"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2004.1327194"},{"key":"e_1_2_2_19_1","unstructured":"John R Hershey and Michael Casey. 2002. Audio-visual sound separation via hidden Markov models. In Advances in Neural Information Processing Systems. 1173--1180.   John R Hershey and Michael Casey. 2002. Audio-visual sound separation via hidden Markov models. In Advances in Neural Information Processing Systems. 1173--1180."},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2016.7471631"},{"key":"e_1_2_2_21_1","volume-title":"ViSQOLAudio: An objective audio quality metric for low bitrate codecs. The Journal of the Acoustical Society of America 137 6","author":"Hines Andrew","year":"2015"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2011.09.004"},{"key":"e_1_2_2_23_1","volume-title":"Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers. CoRR abs\/1706.00079","author":"Hoover Ken","year":"2017"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/TETCI.2017.2784878"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2733373.2806293"},{"key":"e_1_2_2_26_1","volume-title":"Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML.","author":"Ioffe Sergey","year":"2015"},{"key":"e_1_2_2_27_1","volume-title":"Zhuo Chen, Shinji Watanabe, and John R Hershey.","author":"Isik Yusuf","year":"2016"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0004638"},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cub.2009.09.005"},{"key":"e_1_2_2_31_1","volume-title":"Signal Processing Conference","author":"Monaci Gianluca","year":"2011"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2015.7178347"},{"key":"e_1_2_2_33_1","volume-title":"Ng","author":"Ngiam Jiquan","year":"2011"},{"key":"e_1_2_2_34_1","doi-asserted-by":"crossref","unstructured":"Andrew Owens and Alexei A Efros. 2018. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. (2018).  Andrew Owens and Alexei A Efros. 2018. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. (2018).","DOI":"10.1007\/978-3-030-01231-1_39"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1155\/S1110865702206101"},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2017.7952687"},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSP.2013.2296173"},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2001.941023"},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46454-1_2"},{"key":"e_1_2_2_40_1","volume-title":"TIMIT Acoustic-phonetic Continuous Speech Corpus. (11","author":"Garofolo J S","year":"1992"},{"key":"e_1_2_2_41_1","doi-asserted-by":"crossref","unstructured":"Lei Sun Jun Du Li-Rong Dai and Chin-Hui Lee. 2017. Multiple-target deep learning for LSTM-RNN based speech enhancement. In HSCMA.  Lei Sun Jun Du Li-Rong Dai and Chin-Hui Lee. 2017. Multiple-target deep learning for LSTM-RNN based speech enhancement. In HSCMA.","DOI":"10.1109\/HSCMA.2017.7895577"},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2010.5495701"},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2013.6637622"},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSA.2005.858005"},{"key":"e_1_2_2_45_1","volume-title":"Supervised Speech Separation Based on Deep Learning: An Overview. CoRR abs\/1708.07524","author":"Wang DeLiang","year":"2017"},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2014.2352935"},{"key":"e_1_2_2_47_1","doi-asserted-by":"crossref","unstructured":"Ziteng Wang Xiaofei Wang Xu Li Qiang Fu and Yonghong Yan. 2016. Oracle performance investigation of the ideal masks. In IWAENC.  Ziteng Wang Xiaofei Wang Xu Li Qiang Fu and Yonghong Yan. 2016. Oracle performance investigation of the ideal masks. In IWAENC.","DOI":"10.1109\/IWAENC.2016.7602888"},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-22482-4_11"},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2017.7952154"},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10590-1_53"},{"key":"e_1_2_2_51_1","doi-asserted-by":"crossref","unstructured":"Hang Zhao Chuang Gan Andrew Rouditchenko Carl Vondrick Josh McDermott and Antonio Torralba. 2018. The Sound of Pixels. (2018).  Hang Zhao Chuang Gan Andrew Rouditchenko Carl Vondrick Josh McDermott and Antonio Torralba. 2018. The Sound of Pixels. (2018).","DOI":"10.1007\/978-3-030-01246-5_35"},{"key":"e_1_2_2_52_1","volume-title":"Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856","author":"Zhou Bolei","year":"2014"}],"container-title":["ACM Transactions on Graphics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3197517.3201357","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3197517.3201357","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T01:39:44Z","timestamp":1750210784000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3197517.3201357"}},"subtitle":["a speaker-independent audio-visual model for speech separation"],"short-title":[],"issued":{"date-parts":[[2018,7,30]]},"references-count":51,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2018,8,31]]}},"alternative-id":["10.1145\/3197517.3201357"],"URL":"https:\/\/doi.org\/10.1145\/3197517.3201357","relation":{},"ISSN":["0730-0301","1557-7368"],"issn-type":[{"value":"0730-0301","type":"print"},{"value":"1557-7368","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,7,30]]},"assertion":[{"value":"2018-07-30","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}