{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,4]],"date-time":"2026-05-04T10:51:46Z","timestamp":1777891906579,"version":"3.51.4"},"reference-count":109,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2021,3,8]],"date-time":"2021-03-08T00:00:00Z","timestamp":1615161600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Conicyt-Fondecyt","award":["1151306"],"award-info":[{"award-number":["1151306"]}]},{"DOI":"10.13039\/100007297","name":"ONRG","doi-asserted-by":"crossref","award":["62909-17-1-2002"],"award-info":[{"award-number":["62909-17-1-2002"]}],"id":[{"id":"10.13039\/100007297","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["J. Hum.-Robot Interact."],"published-print":{"date-parts":[[2021,6,30]]},"abstract":"<jats:p>This article presents a stand-alone automatic speech recognition system that accounts for listener movement, time-varying reverberation effects, environmental noise, and user position information for beamforming approaches in an HRI setting. We raise the importance of replacing the classical black-box integration of automatic speech recognition technology in HRI applications with the incorporation of the acoustic environment representation and modeling, and of the target source direction. Test data were recorded on a real robot under various moving conditions. For addressing the time-varying acoustic channel problem and incorporating environmental effect during training, clean speech samples were passed through estimated static channel responses and noise was added. Beamforming is investigated regarding oracle source tracking using, for instance, image processing. The proposed strategy is interesting for the robotics community, because it allows the development of voice-based HRI with limited training data and without relying on third-party technologies or Internet access eliminating the need to upload data to the cloud. In our mobile HRI scenario, the resulting speech recognition engine provided an average word error rate that is at least 19% and 34% lower than publicly available speech recognition APIs with the playback (i.e., loudspeaker) and human testing modalities, respectively.<\/jats:p>","DOI":"10.1145\/3442629","type":"journal-article","created":{"date-parts":[[2021,3,8]],"date-time":"2021-03-08T17:06:20Z","timestamp":1615223180000},"page":"1-30","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":14,"title":["Automatic Speech Recognition for Indoor HRI Scenarios"],"prefix":"10.1145","volume":"10","author":[{"given":"Jos\u00e9","family":"Novoa","sequence":"first","affiliation":[{"name":"Speech Processing and Transmission Laboratory, University of Chile, Santiago, Chile"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rodrigo","family":"Mahu","sequence":"additional","affiliation":[{"name":"Speech Processing and Transmission Laboratory, University of Chile, Santiago, Chile"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jorge","family":"Wuth","sequence":"additional","affiliation":[{"name":"Speech Processing and Transmission Laboratory, University of Chile, Santiago, Chile"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Juan Pablo","family":"Escudero","sequence":"additional","affiliation":[{"name":"Speech Processing and Transmission Laboratory, University of Chile, Santiago, Chile"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Josu\u00e9","family":"Fredes","sequence":"additional","affiliation":[{"name":"Speech Processing and Transmission Laboratory, University of Chile, Santiago, Chile"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"N\u00e9stor Becerra","family":"Yoma","sequence":"additional","affiliation":[{"name":"Speech Processing and Transmission Laboratory, University of Chile, Santiago, Chile"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,3,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1561\/1100000005"},{"key":"e_1_2_1_2_1","volume-title":"Proceedings of the IEEE\/RSJ International Conference on Intelligent Robots and Systems. 528--534","author":"Luis","unstructured":"Luis S. Lopes and Antonio Teixeira. 2000. Human-robot interaction through spoken language dialogue . In Proceedings of the IEEE\/RSJ International Conference on Intelligent Robots and Systems. 528--534 . Luis S. Lopes and Antonio Teixeira. 2000. Human-robot interaction through spoken language dialogue. In Proceedings of the IEEE\/RSJ International Conference on Intelligent Robots and Systems. 528--534."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/HRI.2013.6483605"},{"key":"e_1_2_1_4_1","volume-title":"Proceedings of the 12th International Conference on Control, Automation and Systems. 1480--1485","author":"Lin Chao-Yu","year":"2012","unstructured":"Chao-Yu Lin , Kai-Tai Song , Yi-Wen Chen , Shuo-Cheng Chien , Sin-Horng Chen , Chen-Yu Chiang , Jyh-Her Yang , Yi-Chiao Wu and Tzu-Jui Liu . 2012 . User identification design by fusion of face recognition and speaker recognition . In Proceedings of the 12th International Conference on Control, Automation and Systems. 1480--1485 . Chao-Yu Lin, Kai-Tai Song, Yi-Wen Chen, Shuo-Cheng Chien, Sin-Horng Chen, Chen-Yu Chiang, Jyh-Her Yang, Yi-Chiao Wu and Tzu-Jui Liu. 2012. User identification design by fusion of face recognition and speaker recognition. In Proceedings of the 12th International Conference on Control, Automation and Systems. 1480--1485."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSMCA.2012.2216870"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.5898\/JHRI.2.1.Kondo"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIM.2010.2047551"},{"key":"e_1_2_1_8_1","volume-title":"Proceedings of the International Conference on Applied Human Factors and Ergonomics. 183--194","author":"Meszaros Erica L.","unstructured":"Erica L. Meszaros , Meghan Chandarana , Anna Trujillo , and B. Danette Allen . 2017. Compensating for limitations in speech-based natural language processing with multimodal interfaces in UAV operation . In Proceedings of the International Conference on Applied Human Factors and Ergonomics. 183--194 . Erica L. Meszaros, Meghan Chandarana, Anna Trujillo, and B. Danette Allen. 2017. Compensating for limitations in speech-based natural language processing with multimodal interfaces in UAV operation. In Proceedings of the International Conference on Applied Human Factors and Ergonomics. 183--194."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCE.2010.5506027"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cognition.2011.05.005"},{"key":"e_1_2_1_11_1","volume-title":"DARPA Robotics Challenge. Major Qualifying Project","author":"Polido Henrique A.","unstructured":"Henrique A. Polido . 2014. DARPA Robotics Challenge. Major Qualifying Project . Worcester Polytechnic Institute (WPI) , Worcester, MA . Henrique A. Polido. 2014. DARPA Robotics Challenge. Major Qualifying Project. Worcester Polytechnic Institute (WPI), Worcester, MA."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/267658.267738"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/MGRS.2016.2540798"},{"key":"e_1_2_1_14_1","volume-title":"Digital Image Processing and Analysis: Human and Computer Vision Applications with CVIPtools (2nd. ed.)","author":"Umbaugh Scott E.","unstructured":"Scott E. Umbaugh . 2011. Digital Image Processing and Analysis: Human and Computer Vision Applications with CVIPtools (2nd. ed.) . CRC Press . Scott E. Umbaugh. 2011. Digital Image Processing and Analysis: Human and Computer Vision Applications with CVIPtools (2nd. ed.). CRC Press."},{"key":"e_1_2_1_15_1","volume-title":"Burge","author":"Burger Wilhelm","year":"2016","unstructured":"Wilhelm Burger and Mark J . Burge . 2016 . Digital Image Processing: An Algorithmic Introduction Using Java. Springer . Wilhelm Burger and Mark J. Burge. 2016. Digital Image Processing: An Algorithmic Introduction Using Java. Springer."},{"key":"e_1_2_1_16_1","volume-title":"Image Sensors and Signal Processing for Digital Still Cameras","author":"Nakamura Junichi","unstructured":"Junichi Nakamura . 2016. Image Sensors and Signal Processing for Digital Still Cameras . CRC Press . Junichi Nakamura. 2016. Image Sensors and Signal Processing for Digital Still Cameras. CRC Press."},{"key":"e_1_2_1_17_1","volume-title":"HMMs and related speech recognition technologies","author":"Young Steve","unstructured":"Steve Young . 2008. HMMs and related speech recognition technologies . In Springer Handbook of Speech Processing. Jacob Benesty, M. Mohan Sondhi, and Yiteng Huang (Eds.). Springer . 539--558. Steve Young. 2008. HMMs and related speech recognition technologies. In Springer Handbook of Speech Processing. Jacob Benesty, M. Mohan Sondhi, and Yiteng Huang (Eds.). Springer. 539--558."},{"key":"e_1_2_1_18_1","volume-title":"Jack","author":"Huang Xuedong D.","year":"1990","unstructured":"Xuedong D. Huang , Yasuo Ariki , and Mervyn A . Jack . 1990 . Hidden Markov Models for Speech Recognition. Edinburgh University Press . Xuedong D. Huang, Yasuo Ariki, and Mervyn A. Jack. 1990. Hidden Markov Models for Speech Recognition. Edinburgh University Press."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10044-014-0436-0"},{"key":"e_1_2_1_20_1","volume-title":"Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. 275--280","author":"Chen Stanley F.","year":"1998","unstructured":"Stanley F. Chen , Douglas Beeferman , and Ronald Rosenfeld . 1998 . Evaluation metrics for language models . In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. 275--280 . Stanley F. Chen, Douglas Beeferman, and Ronald Rosenfeld. 1998. Evaluation metrics for language models. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. 275--280."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2018-1122"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASRU.2017.8268940"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2015-477"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN.2002.1005585"},{"key":"e_1_2_1_25_1","first-page":"1","article-title":"Feature extraction methods LPC, PLP and MFCC in speech recognition","volume":"1","author":"Dave Namrata","year":"2013","unstructured":"Namrata Dave . 2013 . Feature extraction methods LPC, PLP and MFCC in speech recognition . Int. J. Adv. Res. Eng. Technol. 1 , 6 (2013), 1 -- 5 . Namrata Dave. 2013. Feature extraction methods LPC, PLP and MFCC in speech recognition. Int. J. Adv. Res. Eng. Technol. 1, 6 (2013), 1--5.","journal-title":"Int. J. Adv. Res. Eng. Technol."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASSP.1986.1164788"},{"key":"e_1_2_1_27_1","first-page":"3464","article-title":"Language-model\/acoustic channel balance mechanism","volume":"23","author":"Bahl Lalit R.","year":"1980","unstructured":"Lalit R. Bahl . 1980 . Language-model\/acoustic channel balance mechanism . IBM Techn. Disclos. Bull. 23 , 7B (1980), 3464 -- 3465 . Lalit R. Bahl. 1980. Language-model\/acoustic channel balance mechanism. IBM Techn. Disclos. Bull. 23, 7B (1980), 3464--3465.","journal-title":"IBM Techn. Disclos. Bull."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSP.2012.2205597"},{"key":"e_1_2_1_29_1","volume-title":"Godfrey and Edward Holliman","author":"John","year":"1997","unstructured":"John J. Godfrey and Edward Holliman . 1997 . Switchboard-1 Release 2. LDC97S62. Linguistic Data Consortium , Philadelphia, PA. John J. Godfrey and Edward Holliman. 1997. Switchboard-1 Release 2. LDC97S62. Linguistic Data Consortium, Philadelphia, PA."},{"key":"e_1_2_1_30_1","volume-title":"Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events. 80--84","author":"Schr\u00f6der Jens","year":"2016","unstructured":"Jens Schr\u00f6der , J\u00f6rn Anem\u00fcller , and Stefan Goetze . 2016 . Performance comparison of GMM, HMM and DNN based approaches for acoustic event detection within Task 3 of the DCASE 2016 challenge . In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events. 80--84 . Jens Schr\u00f6der, J\u00f6rn Anem\u00fcller, and Stefan Goetze. 2016. Performance comparison of GMM, HMM and DNN based approaches for acoustic event detection within Task 3 of the DCASE 2016 challenge. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events. 80--84."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2017-234"},{"key":"e_1_2_1_32_1","volume-title":"The IBM 2015 english conversational telephone speech recognition system. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH\u201915)","author":"Saon George","year":"2015","unstructured":"George Saon , Hong-Kwang J. Kuo , Steven Rennie , and Michael Picheny . 2015 . The IBM 2015 english conversational telephone speech recognition system. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH\u201915) . 3140--3144. George Saon, Hong-Kwang J. Kuo, Steven Rennie, and Michael Picheny. 2015. The IBM 2015 english conversational telephone speech recognition system. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH\u201915). 3140--3144."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2017.7953159"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2014.2339736"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2013.6638947"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2016.7472809"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASRU.2015.7404793"},{"key":"e_1_2_1_39_1","volume-title":"Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH\u201916)","author":"Tara","unstructured":"Tara N. Sainath and Bo. Li. 2016. Modeling time-frequency patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks . In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH\u201916) . 813--817. Tara N. Sainath and Bo. Li. 2016. Modeling time-frequency patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH\u201916). 813--817."},{"key":"e_1_2_1_40_1","volume-title":"Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH\u201916)","author":"Liu Yuzong","year":"2016","unstructured":"Yuzong Liu and Katrin Kirchhoff . 2016 . Novel front-end features based on neural graph embeddings for DNNHMM and LSTM-CTC acoustic modeling . In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH\u201916) . 793--797. Yuzong Liu and Katrin Kirchhoff. 2016. Novel front-end features based on neural graph embeddings for DNNHMM and LSTM-CTC acoustic modeling. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH\u201916). 793--797."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2016-251"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2016.2602884"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2016-966"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2013.6639100"},{"key":"e_1_2_1_45_1","series-title":"Lecture Notes in Computer Science","volume-title":"Text, Speech, and Dialogue","author":"Malek Jiri","unstructured":"Jiri Malek and Jindrich Zdansky . 2019. On practical aspects of multi-condition training based on augmentation for reverberation-\/noise-robust speech recognition . In Text, Speech, and Dialogue , Lecture Notes in Computer Science , Vol. 11697 . Springer , Cham . Jiri Malek and Jindrich Zdansky. 2019. On practical aspects of multi-condition training based on augmentation for reverberation-\/noise-robust speech recognition. In Text, Speech, and Dialogue, Lecture Notes in Computer Science, Vol. 11697. Springer, Cham."},{"key":"e_1_2_1_46_1","first-page":"1","article-title":"Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation","volume":"1","author":"Alam Md Jahangir","year":"2015","unstructured":"Md Jahangir Alam , Vishwa Gupta , Patrick Kenny , and Pierre Dumouchel . 2015 . Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation . EURASIP J. Adv. Sign. Process. 1 , 50 (2015), 1 -- 13 . Md Jahangir Alam, Vishwa Gupta, Patrick Kenny, and Pierre Dumouchel. 2015. Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation. EURASIP J. Adv. Sign. Process. 1, 50 (2015), 1--13.","journal-title":"EURASIP J. Adv. Sign. Process."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2014.6854681"},{"key":"e_1_2_1_48_1","volume-title":"The HTK Book","author":"Young Steve","unstructured":"Steve Young , Gunnar Evermann , Mark Gales , Thomas Hain , Dan Kershaw , Xunying Liu , Gareth Moore , Julian Odell , Dave Ollason , Dan Povey , and others. 2006. The HTK Book . Cambridge University Engineering Department . Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, and others. 2006. The HTK Book. Cambridge University Engineering Department."},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/29.45616"},{"key":"e_1_2_1_50_1","volume-title":"Sphinx-4: A flexible open source framework for speech recognition. Sun Microsystems","author":"Walker Willie","year":"2004","unstructured":"Willie Walker , Paul Lamere , Philip Kwok , Bhiksha Raj , Rita Singh , Evandro Gouvea , Peter Wolf , and Joe Woelfel . 2004. Sphinx-4: A flexible open source framework for speech recognition. Sun Microsystems , Inc., SMLI TR- 2004 --139. Sun Microsystems, Inc. Menlo Park, CA. Willie Walker, Paul Lamere, Philip Kwok, Bhiksha Raj, Rita Singh, Evandro Gouvea, Peter Wolf, and Joe Woelfel. 2004. Sphinx-4: A flexible open source framework for speech recognition. Sun Microsystems, Inc., SMLI TR-2004--139. Sun Microsystems, Inc. Menlo Park, CA."},{"key":"e_1_2_1_51_1","volume-title":"Proceeding of the Annual Conference of the International Speech Communication Association (INTERSPEECH\u201901)","author":"Lee Akinobu","year":"2001","unstructured":"Akinobu Lee , Tatsuya Kawahara , and Kiyohiro Shikano . 2001 . JULIUS - an open source real-time large vocabulary recognition engine . In Proceeding of the Annual Conference of the International Speech Communication Association (INTERSPEECH\u201901) . 1691--1694. Akinobu Lee, Tatsuya Kawahara, and Kiyohiro Shikano. 2001. JULIUS - an open source real-time large vocabulary recognition engine. In Proceeding of the Annual Conference of the International Speech Communication Association (INTERSPEECH\u201901). 1691--1694."},{"key":"e_1_2_1_52_1","volume-title":"Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding.","author":"Povey Daniel","year":"2011","unstructured":"Daniel Povey , Arnab Ghoshal , Gilles Boulianne , Lukas Burget , Ondrej Glembek , Nagendra Goel , Mirko Hannemann , Petr Motlicek , Yanmin Qian , Petr Schwarz , and others. 2011 . The kaldi speech recognition toolkit . In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, and others. 2011. The kaldi speech recognition toolkit. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/SLT.2012.6424249"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1007\/s12369-013-0217-8"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/2666242.2666248"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2013-587"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/TII.2016.2625818"},{"key":"e_1_2_1_58_1","volume-title":"Proceedings of the 28th National Conference on Artificial Intelligence. 2556--2563","author":"Matuszek Cynthia","year":"2014","unstructured":"Cynthia Matuszek , Liefeng Bo , Luke Zettlemoyer , and Dieter Fox . 2014 . Learning from unscripted deictic gesture and language for human-robot interactions . In Proceedings of the 28th National Conference on Artificial Intelligence. 2556--2563 . Cynthia Matuszek, Liefeng Bo, Luke Zettlemoyer, and Dieter Fox. 2014. Learning from unscripted deictic gesture and language for human-robot interactions. In Proceedings of the 28th National Conference on Artificial Intelligence. 2556--2563."},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/2909824.3020229"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/COASE.2017.8256096"},{"key":"e_1_2_1_61_1","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence (AAAI\u201916)","author":"Fischer Michael","year":"2016","unstructured":"Michael Fischer , Samir Menon , and Oussama Khatib . 2016 . From bot to bot: Using a chat bot to synthesize robot motion . In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI\u201916) . Michael Fischer, Samir Menon, and Oussama Khatib. 2016. From bot to bot: Using a chat bot to synthesize robot motion. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI\u201916)."},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.15607\/RSS.2018.XIV.028"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1109\/SLT.2014.7078545"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2015-54"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1080\/01691864.2016.1164622"},{"key":"e_1_2_1_66_1","volume-title":"Proceedings of the Conference on Electronic Speech Signal Processing. 1--10","author":"Lange Patrick","year":"2014","unstructured":"Patrick Lange and David Suendermann-Oeft . 2014 . Tuning sphinx to outperform google's speech recognition API . In Proceedings of the Conference on Electronic Speech Signal Processing. 1--10 . Patrick Lange and David Suendermann-Oeft. 2014. Tuning sphinx to outperform google's speech recognition API. In Proceedings of the Conference on Electronic Speech Signal Processing. 1--10."},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1109\/ROMAN.2014.6926324"},{"key":"e_1_2_1_68_1","unstructured":"Matthew Marge Claire Bonial Brendan Byrne Taylor Cassidy A. William Evans Susan G. Hill and Clare Voss. 2017. Applying the wizard-of-oz technique to multimodal human-robot dialogue. arXiv:1703.03714. Retrieved from https:\/\/arxiv.org\/abs\/1703.03714.  Matthew Marge Claire Bonial Brendan Byrne Taylor Cassidy A. William Evans Susan G. Hill and Clare Voss. 2017. Applying the wizard-of-oz technique to multimodal human-robot dialogue. arXiv:1703.03714. Retrieved from https:\/\/arxiv.org\/abs\/1703.03714."},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1109\/HRI.2016.7451752"},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1109\/HRI.2016.7451794"},{"key":"e_1_2_1_71_1","volume-title":"Proceedings of the AAAI Spring Symposium Series.","author":"Hoffman Guy","year":"2016","unstructured":"Guy Hoffman . 2016 . OpenWoZ: A runtime-configurable wizard-of-oz framework for human-robot interaction . In Proceedings of the AAAI Spring Symposium Series. Guy Hoffman. 2016. OpenWoZ: A runtime-configurable wizard-of-oz framework for human-robot interaction. In Proceedings of the AAAI Spring Symposium Series."},{"key":"e_1_2_1_72_1","volume-title":"Proceedings of the AAAI Spring Symposium Series.","author":"Martelaro Nikolas","year":"2016","unstructured":"Nikolas Martelaro . 2016 . Wizard-of-oz interfaces as a step towards autonomous HRI . In Proceedings of the AAAI Spring Symposium Series. Nikolas Martelaro. 2016. Wizard-of-oz interfaces as a step towards autonomous HRI. In Proceedings of the AAAI Spring Symposium Series."},{"key":"e_1_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1109\/HRI.2016.7451823"},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1109\/HRI.2016.7451832"},{"key":"e_1_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1109\/HRI.2016.7451888"},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1109\/HSCMA.2017.7895560"},{"key":"e_1_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2015.7177990"},{"key":"e_1_2_1_78_1","volume-title":"Speech recognition with microphone arrays","author":"Omologo Maurizio","unstructured":"Maurizio Omologo , Marco Matassoni , and Piergiorgio Svaizer . 2001. Speech recognition with microphone arrays . In Microphone Arrays, Signal Processing Techniques and Applications. Michael Brandstein and Darren Ward (Eds.). Springer-Verlag , 331--353. Maurizio Omologo, Marco Matassoni, and Piergiorgio Svaizer. 2001. Speech recognition with microphone arrays. In Microphone Arrays, Signal Processing Techniques and Applications. Michael Brandstein and Darren Ward (Eds.). Springer-Verlag, 331--353."},{"key":"e_1_2_1_79_1","volume-title":"Superdirective microphone arrays","author":"Bitzer Joerg","unstructured":"Joerg Bitzer and K. Uwe Simmer . 2001. Superdirective microphone arrays . In Microphone Arrays, Signal Processing Techniques and Applications. Michael Brandstein and Darren Ward (Eds.). Springer-Verlag , 19--38. Joerg Bitzer and K. Uwe Simmer. 2001. Superdirective microphone arrays. In Microphone Arrays, Signal Processing Techniques and Applications. Michael Brandstein and Darren Ward (Eds.). Springer-Verlag, 19--38."},{"key":"e_1_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.engappai.2015.04.015"},{"key":"e_1_2_1_81_1","volume-title":"Post-filtering techniques","author":"Simmer K. Uwe","unstructured":"K. Uwe Simmer , Joerg Bitzer , and Claude Marro . 2001. Post-filtering techniques . In Microphone Arrays, Signal Processing Techniques and Applications. Michael Brandstein and Darren Ward (Eds.). Springer-Verlag , 39--60. K. Uwe Simmer, Joerg Bitzer, and Claude Marro. 2001. Post-filtering techniques. In Microphone Arrays, Signal Processing Techniques and Applications. Michael Brandstein and Darren Ward (Eds.). Springer-Verlag, 39--60."},{"key":"e_1_2_1_82_1","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2016.2642178"},{"key":"e_1_2_1_83_1","volume-title":"Proceedings of the ACM\/IEEE International Conference on Human-Robot Interaction. 150--159","author":"Novoa Jos\u00e9","year":"2018","unstructured":"Jos\u00e9 Novoa , Jorge Wuth , Juan Pablo Escudero , Josu\u00e9 Fredes , Rodrigo Mahu , and N\u00e9stor Becerra Yoma . 2018 . DNNHMM based automatic speech recognition for HRI scenarios . In Proceedings of the ACM\/IEEE International Conference on Human-Robot Interaction. 150--159 . Jos\u00e9 Novoa, Jorge Wuth, Juan Pablo Escudero, Josu\u00e9 Fredes, Rodrigo Mahu, and N\u00e9stor Becerra Yoma. 2018. DNNHMM based automatic speech recognition for HRI scenarios. In Proceedings of the ACM\/IEEE International Conference on Human-Robot Interaction. 150--159."},{"key":"e_1_2_1_84_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.compeleceng.2015.12.010"},{"key":"e_1_2_1_85_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICARA.2015.7081122"},{"key":"e_1_2_1_86_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-11900-7_40"},{"key":"e_1_2_1_87_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICAR.2015.7251484"},{"key":"e_1_2_1_88_1","doi-asserted-by":"publisher","DOI":"10.1109\/WCICA.2016.7578518"},{"key":"e_1_2_1_89_1","volume-title":"The fifth \u2018CHiME","author":"Barker Jon","year":"1803","unstructured":"Jon Barker , Shinji Watanabe , Emmanuel Vincent , and Jan Trmal . 2018. The fifth \u2018CHiME \u2019 speech separation and recognition challenge: Dataset, task and baselines. arXiv: 1803 .10609. Retrieved from https:\/\/arxiv.org\/abs\/1803.10609. Jon Barker, Shinji Watanabe, Emmanuel Vincent, and Jan Trmal. 2018. The fifth \u2018CHiME\u2019 speech separation and recognition challenge: Dataset, task and baselines. arXiv:1803.10609. Retrieved from https:\/\/arxiv.org\/abs\/1803.10609."},{"key":"e_1_2_1_90_1","doi-asserted-by":"publisher","DOI":"10.1145\/1121241.1121272"},{"key":"e_1_2_1_91_1","volume-title":"Josu\u00e9 Fredes, Jorge Wuth, Rodrigo Mahu, and N\u00e9stor Becerra Yoma.","author":"Novoa Jos\u00e9","year":"2017","unstructured":"Jos\u00e9 Novoa , Juan Pablo Escudero , Josu\u00e9 Fredes, Jorge Wuth, Rodrigo Mahu, and N\u00e9stor Becerra Yoma. 2017 . Multichannel robot speech recognition database: MChRSR. arXiv:1801.00061. https:\/\/arxiv.org\/abs\/1801.00061. Jos\u00e9 Novoa, Juan Pablo Escudero, Josu\u00e9 Fredes, Jorge Wuth, Rodrigo Mahu, and N\u00e9stor Becerra Yoma. 2017. Multichannel robot speech recognition database: MChRSR. arXiv:1801.00061. https:\/\/arxiv.org\/abs\/1801.00061."},{"key":"e_1_2_1_92_1","volume-title":"Proceedings of the Audio Engineering Society Convention 108","author":"Farina Angelo","year":"2000","unstructured":"Angelo Farina . 2000 . Simultaneous measurement of impulse response and distortion with a swept-sine technique . In Proceedings of the Audio Engineering Society Convention 108 . 1--23. Angelo Farina. 2000. Simultaneous measurement of impulse response and distortion with a swept-sine technique. In Proceedings of the Audio Engineering Society Convention 108. 1--23."},{"key":"e_1_2_1_93_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.csl.2017.02.003"},{"key":"e_1_2_1_94_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.csl.2017.02.001"},{"key":"e_1_2_1_95_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2013-548"},{"key":"e_1_2_1_96_1","volume-title":"Proceedings of the Symposium of the Pattern Recognition Association of South Africa (PRASA\u201909)","author":"Kamper Herman","year":"2009","unstructured":"Herman Kamper and Thomas Niesler . 2009 . Characterisation and simulation of telephone channels using the TIMIT and NTIMIT databases . In Proceedings of the Symposium of the Pattern Recognition Association of South Africa (PRASA\u201909) , 47--52. Herman Kamper and Thomas Niesler. 2009. Characterisation and simulation of telephone channels using the TIMIT and NTIMIT databases. In Proceedings of the Symposium of the Pattern Recognition Association of South Africa (PRASA\u201909), 47--52."},{"key":"e_1_2_1_97_1","volume-title":"Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 97--100","author":"Pallett David S.","unstructured":"David S. Pallett , William M. Fisher , and Jonathan G. Fiscus . 1990. Tools for the analysis of benchmark speech recognition tests . In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 97--100 . David S. Pallett, William M. Fisher, and Jonathan G. Fiscus. 1990. Tools for the analysis of benchmark speech recognition tests. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 97--100."},{"key":"e_1_2_1_98_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.1995.479274"},{"key":"e_1_2_1_99_1","volume-title":"Version 2.0, AU\/417\/02","author":"Hirsch Guenter","unstructured":"Guenter Hirsch . 2002. Experimental framework for the performance evaluation of speech recognition front-ends on a large vocabulary task , Version 2.0, AU\/417\/02 . ETSI STQ Aurora DSR Working Group . Guenter Hirsch. 2002. Experimental framework for the performance evaluation of speech recognition front-ends on a large vocabulary task, Version 2.0, AU\/417\/02. ETSI STQ Aurora DSR Working Group."},{"key":"e_1_2_1_100_1","volume-title":"Proceedings of the Workshop on Speech and Natural Language, Harriman, 357--362","author":"Douglas","unstructured":"Douglas B. Paul and Janet M. Baker. 1992. The design for the wall street journal-based CSR corpus . In Proceedings of the Workshop on Speech and Natural Language, Harriman, 357--362 . Douglas B. Paul and Janet M. Baker. 1992. The design for the wall street journal-based CSR corpus. In Proceedings of the Workshop on Speech and Natural Language, Harriman, 357--362."},{"key":"e_1_2_1_101_1","volume-title":"Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP\u201994)","author":"Woodland Philip C.","unstructured":"Philip C. Woodland , Julian J. Odell , Valtcho Valtchev , and Steve J. Young . 1994. Large vocabulary continuous speech recognition using HTK . In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP\u201994) , vol. II , II\/125-II\/128. Philip C. Woodland, Julian J. Odell, Valtcho Valtchev, and Steve J. Young. 1994. Large vocabulary continuous speech recognition using HTK. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP\u201994), vol. II, II\/125-II\/128."},{"key":"e_1_2_1_102_1","unstructured":"Guenter Hirsch. 2005. FaNT Filtering and Noise Adding Tool. Software. Retrieved from http:\/\/dnt.kr.hs-niederrhein.de\/.  Guenter Hirsch. 2005. FaNT Filtering and Noise Adding Tool. Software. Retrieved from http:\/\/dnt.kr.hs-niederrhein.de\/."},{"key":"e_1_2_1_103_1","unstructured":"Microsoft. 2013. Kinect for Windows Software Development kit v1.8. Retrieved from https:\/\/www.microsoft.com\/en-us\/download\/details.aspx?id=40278.  Microsoft. 2013. Kinect for Windows Software Development kit v1.8. Retrieved from https:\/\/www.microsoft.com\/en-us\/download\/details.aspx?id=40278."},{"key":"e_1_2_1_104_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2007.902460"},{"key":"e_1_2_1_105_1","unstructured":"Anthony Zhang. 2017. Speech Recognition (v3.7). Software. Retrieved from https:\/\/github.com\/Uberi\/speech_recognition.  Anthony Zhang. 2017. Speech Recognition (v3.7). Software. Retrieved from https:\/\/github.com\/Uberi\/speech_recognition."},{"key":"e_1_2_1_106_1","volume-title":"Retrieved","author":"Synnaeve Gabriel","year":"2020","unstructured":"Gabriel Synnaeve . 2020 . WER Are We? . Retrieved July 21, 2020 from https:\/\/github.com\/syhw\/wer_are_we. Gabriel Synnaeve. 2020. WER Are We?. Retrieved July 21, 2020 from https:\/\/github.com\/syhw\/wer_are_we."},{"key":"e_1_2_1_107_1","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2017.2661699"},{"key":"e_1_2_1_108_1","doi-asserted-by":"publisher","DOI":"10.1109\/COASE.2017.8256096"},{"key":"e_1_2_1_109_1","volume-title":"Proceedings of the 2016 AAAI Fall Symposium Series.","author":"Fischer Michael","year":"2016","unstructured":"Michael Fischer , Samir Menon , and Oussama Khatib . 2016 . From bot to bot: Using a chat bot to synthesize robot motion . In Proceedings of the 2016 AAAI Fall Symposium Series. Michael Fischer, Samir Menon, and Oussama Khatib. 2016. From bot to bot: Using a chat bot to synthesize robot motion. In Proceedings of the 2016 AAAI Fall Symposium Series."}],"container-title":["ACM Transactions on Human-Robot Interaction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3442629","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3442629","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3442629","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:03:03Z","timestamp":1750197783000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3442629"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,3,8]]},"references-count":109,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2021,6,30]]}},"alternative-id":["10.1145\/3442629"],"URL":"https:\/\/doi.org\/10.1145\/3442629","relation":{},"ISSN":["2573-9522"],"issn-type":[{"value":"2573-9522","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,3,8]]},"assertion":[{"value":"2018-12-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-03-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}