{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,30]],"date-time":"2025-12-30T23:47:11Z","timestamp":1767138431087,"version":"build-2238731810"},"publisher-location":"Cham","reference-count":32,"publisher":"Springer Nature Switzerland","isbn-type":[{"value":"9783031159305","type":"print"},{"value":"9783031159312","type":"electronic"}],"license":[{"start":{"date-parts":[[2022,1,1]],"date-time":"2022-01-01T00:00:00Z","timestamp":1640995200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/www.springernature.com\/gp\/researchers\/text-and-data-mining"},{"start":{"date-parts":[[2022,1,1]],"date-time":"2022-01-01T00:00:00Z","timestamp":1640995200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.springernature.com\/gp\/researchers\/text-and-data-mining"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Conducting a dialog in human-robot interaction (HRI) involves complexities that are hard to reconcile by individual research or engineering works. Towards the development of a robotic dialog agent, we develop a verbal and visual instruction scenario in which a robot needs to enter into a dialog to resolve ambiguities. We propose a novel hybrid neural architecture to learn the robotic part of the interaction. A neural dialog state tracker learns to process the user input depending on visual inputs and dialog instances. It uses variables to allow certain generality to generate the robot\u2019s physical or verbal actions. We train it on a new visual dialog dataset, test different forms of input representations, and validate the robot agent on unseen examples. We evaluate our hybrid neural network approach in handling an HRI conversation scenario that is extendable to a real robot. Furthermore, we demonstrate that the hybrid approach allows generalization to a large range of unseen visual inputs and verbal instructions.<\/jats:p>","DOI":"10.1007\/978-3-031-15931-2_22","type":"book-chapter","created":{"date-parts":[[2022,9,6]],"date-time":"2022-09-06T01:03:47Z","timestamp":1662426227000},"page":"258-269","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Learning Visually Grounded Human-Robot Dialog in\u00a0a\u00a0Hybrid Neural Architecture"],"prefix":"10.1007","author":[{"given":"Xiaowen","family":"Sun","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Cornelius","family":"Weber","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Matthias","family":"Kerzel","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tom","family":"Weber","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mengdi","family":"Li","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Stefan","family":"Wermter","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2022,9,7]]},"reference":[{"key":"22_CR1","unstructured":"Alomari, M., Dukes, K.: Extended train robots (2016). https:\/\/doi.org\/10.5518\/32"},{"key":"22_CR2","doi-asserted-by":"crossref","unstructured":"Bagaskara, A., Naufal, A.R., Dhojopatmo, I.E., Abdurrab, A., Budiharto, W.: Development of smart restaurant application for dine-in. In: Conference on Computer Science and Artificial Intelligence, vol. 1, pp. 230\u2013235 (2021)","DOI":"10.1109\/ICCSAI53272.2021.9609723"},{"key":"22_CR3","unstructured":"Bordes, A., Boureau, Y.L., Weston, J.: Learning end-to-end goal-oriented dialog. Preprint arXiv:1605.07683 (2016)"},{"key":"22_CR4","doi-asserted-by":"crossref","unstructured":"Brabra, H., B\u00e1ez, M., Benatallah, B., Gaaloul, W., Bouguelia, S., Zamanirad, S.: Dialogue management in conversational systems: a review of approaches, challenges, and opportunities. IEEE Trans. Cogn. Dev. Syst. (2021)","DOI":"10.1109\/TCDS.2021.3086565"},{"key":"22_CR5","doi-asserted-by":"crossref","unstructured":"Calli, B., Walsman, A., Singh, A., Srinivasa, S., Abbeel, P., Dollar, A.M.: Benchmarking in manipulation research: the YCB object and model set and benchmarking protocols. Preprint arXiv:1502.03143 (2015)","DOI":"10.1109\/MRA.2015.2448951"},{"key":"22_CR6","doi-asserted-by":"crossref","unstructured":"Das, A., et al.: Visual dialog. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)","DOI":"10.1109\/CVPR.2017.121"},{"key":"22_CR7","doi-asserted-by":"crossref","unstructured":"De Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.: GuessWhat?! Visual object discovery through multi-modal dialogue. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)","DOI":"10.1109\/CVPR.2017.475"},{"key":"22_CR8","unstructured":"Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)"},{"key":"22_CR9","doi-asserted-by":"crossref","unstructured":"Goel, R., Paul, S., Hakkani-T\u00fcr, D.: HyST: a hybrid approach for flexible and accurate dialogue state tracking. Preprint arXiv:1907.00883 (2019)","DOI":"10.21437\/Interspeech.2019-1863"},{"key":"22_CR10","doi-asserted-by":"crossref","unstructured":"Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)","DOI":"10.1109\/CVPR.2017.670"},{"key":"22_CR11","doi-asserted-by":"crossref","unstructured":"Henderson, M., Thomson, B., Young, S.: Word-based dialog state tracking with recurrent neural networks. In: 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 292\u2013299 (2014)","DOI":"10.3115\/v1\/W14-4340"},{"issue":"8","key":"22_CR12","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","volume":"9","author":"S Hochreiter","year":"1997","unstructured":"Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735\u20131780 (1997)","journal-title":"Neural Comput."},{"key":"22_CR13","unstructured":"Huang, X., et al.: Joint generation and bi-encoder for situated interactive multimodal conversations. In: AAAI 2021 DSTC9 Workshop (2021)"},{"key":"22_CR14","unstructured":"Jeong, Y., Lee, S.J., Ko, Y., Seo, J.: TOM: end-to-end task-oriented multimodal dialog system with GPT-2. In: AAAI 2021 DSTC9 Workshop (2021)"},{"key":"22_CR15","doi-asserted-by":"crossref","unstructured":"Kerzel, M., Abawi, F., Eppe, M., Wermter, S.: Enhancing a neurocognitive shared visuomotor model for object identification, localization, and grasping with learning from auxiliary tasks. IEEE Trans. Cogn. Dev. Syst. 1\u201313 (2020)","DOI":"10.1109\/DEVLRN.2019.8850679"},{"key":"22_CR16","doi-asserted-by":"crossref","unstructured":"Kerzel, M., Strahl, E., Magg, S., Navarro-Guerrero, N., Heinrich, S., Wermter, S.: NICO-neuro-inspired COmpanion: a developmental humanoid robot platform for multimodal interaction. In: IEEE International Symposium on Robot and Human Interactive Communication, pp. 113\u2013120 (2017)","DOI":"10.1109\/ROMAN.2017.8172289"},{"key":"22_CR17","doi-asserted-by":"crossref","unstructured":"Kottur, S., Moon, S., Geramifard, A., Damavandi, B.: SIMMC 2.0: a task-oriented dialog dataset for immersive multimodal conversations. arXiv:2104.08667 (2021)","DOI":"10.18653\/v1\/2021.emnlp-main.401"},{"key":"22_CR18","unstructured":"Kottur, S., Moura, J.M., Parikh, D., Batra, D., Rohrbach, M.: CLEVR-dialog: a diagnostic dataset for multi-round reasoning in visual dialog. Preprint arXiv:1903.03166 (2019)"},{"key":"22_CR19","doi-asserted-by":"crossref","unstructured":"Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Preprint arXiv:1910.13461 (2019)","DOI":"10.18653\/v1\/2020.acl-main.703"},{"key":"22_CR20","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Goyal, P., Girshick, R.B., He, K., Doll\u00e1r, P.: Focal loss for dense object detection. IEEE International Conference on Computer Vision (2017)","DOI":"10.1109\/ICCV.2017.324"},{"key":"22_CR21","series-title":"Lecture Notes in Computer Science","doi-asserted-by":"publisher","first-page":"740","DOI":"10.1007\/978-3-319-10602-1_48","volume-title":"Computer Vision \u2013 ECCV 2014","author":"T-Y Lin","year":"2014","unstructured":"Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740\u2013755. Springer, Cham (2014). https:\/\/doi.org\/10.1007\/978-3-319-10602-1_48"},{"key":"22_CR22","unstructured":"Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)"},{"key":"22_CR23","unstructured":"Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013)"},{"key":"22_CR24","doi-asserted-by":"crossref","unstructured":"Moon, S., et al.: Situated and interactive multimodal conversations. Preprint arXiv:2006.01460 (2020)","DOI":"10.18653\/v1\/2020.coling-main.96"},{"key":"22_CR25","unstructured":"Mostafazadeh, N., et al..: Image-grounded conversations: multimodal context for natural question and response generation. Preprint arXiv:1701.08251 (2017)"},{"key":"22_CR26","series-title":"Lecture Notes in Computer Science","doi-asserted-by":"publisher","first-page":"336","DOI":"10.1007\/978-3-030-58523-5_20","volume-title":"Computer Vision \u2013 ECCV 2020","author":"V Murahari","year":"2020","unstructured":"Murahari, V., Batra, D., Parikh, D., Das, A.: Large-scale pretraining for visual dialog: a simple state-of-the-art baseline. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12363, pp. 336\u2013352. Springer, Cham (2020). https:\/\/doi.org\/10.1007\/978-3-030-58523-5_20"},{"key":"22_CR27","doi-asserted-by":"crossref","unstructured":"Ni, J., Young, T., Pandelea, V., Xue, F., Adiga, V., Cambria, E.: Recent advances in deep learning based dialogue systems: a systematic survey. Preprint arXiv:2105.04387 (2021)","DOI":"10.1007\/s10462-022-10248-8"},{"key":"22_CR28","doi-asserted-by":"crossref","unstructured":"Qian, K., et al.: Database search results disambiguation for task-oriented dialog systems. Preprint arXiv:2112.08351 (2021)","DOI":"10.18653\/v1\/2022.naacl-main.85"},{"key":"22_CR29","unstructured":"Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)"},{"key":"22_CR30","unstructured":"Weston, J., et al.: Towards AI-complete question answering: a set of prerequisite toy tasks. Preprint arXiv:1502.05698 (2015)"},{"key":"22_CR31","doi-asserted-by":"crossref","unstructured":"Williams, J.D., Asadi, K., Zweig, G.: Hybrid code networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning. Preprint arXiv:1702.03274 (2017)","DOI":"10.18653\/v1\/P17-1062"},{"key":"22_CR32","unstructured":"Zeiler, M.D.: ADADELTA: an adaptive learning rate method. Preprint arXiv:1212.5701 (2012)"}],"updated-by":[{"DOI":"10.1007\/978-3-031-15931-2_67","type":"correction","label":"Correction","source":"publisher","updated":{"date-parts":[[2023,4,5]],"date-time":"2023-04-05T00:00:00Z","timestamp":1680652800000}}],"container-title":["Lecture Notes in Computer Science","Artificial Neural Networks and Machine Learning \u2013 ICANN 2022"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/978-3-031-15931-2_22","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,4,4]],"date-time":"2023-04-04T14:18:44Z","timestamp":1680617924000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/978-3-031-15931-2_22"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022]]},"ISBN":["9783031159305","9783031159312"],"references-count":32,"URL":"https:\/\/doi.org\/10.1007\/978-3-031-15931-2_22","relation":{},"ISSN":["0302-9743","1611-3349"],"issn-type":[{"value":"0302-9743","type":"print"},{"value":"1611-3349","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022]]},"assertion":[{"value":"7 September 2022","order":1,"name":"first_online","label":"First Online","group":{"name":"ChapterHistory","label":"Chapter History"}},{"value":"5 April 2023","order":2,"name":"change_date","label":"Change Date","group":{"name":"ChapterHistory","label":"Chapter History"}},{"value":"Correction","order":3,"name":"change_type","label":"Change Type","group":{"name":"ChapterHistory","label":"Chapter History"}},{"value":"A correction has been published.","order":4,"name":"change_details","label":"Change Details","group":{"name":"ChapterHistory","label":"Chapter History"}},{"value":"ICANN","order":1,"name":"conference_acronym","label":"Conference Acronym","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"International Conference on Artificial Neural Networks","order":2,"name":"conference_name","label":"Conference Name","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Bristol","order":3,"name":"conference_city","label":"Conference City","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"United Kingdom","order":4,"name":"conference_country","label":"Conference Country","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"2022","order":5,"name":"conference_year","label":"Conference Year","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"6 September 2022","order":7,"name":"conference_start_date","label":"Conference Start Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"9 September 2022","order":8,"name":"conference_end_date","label":"Conference End Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"31","order":9,"name":"conference_number","label":"Conference Number","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"icann2022","order":10,"name":"conference_id","label":"Conference ID","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"https:\/\/e-nns.org\/icann2022\/","order":11,"name":"conference_url","label":"Conference URL","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Single-blind","order":1,"name":"type","label":"Type","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"EasyChair","order":2,"name":"conference_management_system","label":"Conference Management System","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"561","order":3,"name":"number_of_submissions_sent_for_review","label":"Number of Submissions Sent for Review","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"255","order":4,"name":"number_of_full_papers_accepted","label":"Number of Full Papers Accepted","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"4","order":5,"name":"number_of_short_papers_accepted","label":"Number of Short Papers Accepted","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"45% - The value is computed by the equation \"Number of Full Papers Accepted \/ Number of Submissions Sent for Review * 100\" and then rounded to a whole number.","order":6,"name":"acceptance_rate_of_full_papers","label":"Acceptance Rate of Full Papers","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"3","order":7,"name":"average_number_of_reviews_per_paper","label":"Average Number of Reviews per Paper","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"No","order":9,"name":"external_reviewers_involved","label":"External Reviewers Involved","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}}]}}