{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,13]],"date-time":"2026-04-13T18:52:43Z","timestamp":1776106363916,"version":"3.50.1"},"publisher-location":"Cham","reference-count":34,"publisher":"Springer Nature Switzerland","isbn-type":[{"value":"9783031723407","type":"print"},{"value":"9783031723414","type":"electronic"}],"license":[{"start":{"date-parts":[[2024,1,1]],"date-time":"2024-01-01T00:00:00Z","timestamp":1704067200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,9,17]],"date-time":"2024-09-17T00:00:00Z","timestamp":1726531200000},"content-version":"vor","delay-in-days":260,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>We investigate the use of Large Language Models (LLMs) to equip neural robotic agents with human-like social and cognitive competencies, for the purpose of open-ended human-robot conversation and collaboration. We introduce a modular and extensible methodology for grounding an LLM with the sensory perceptions and capabilities of a physical robot, and integrate multiple deep learning models throughout the architecture in a form of system integration. The integrated models encompass various functions such as speech recognition, speech generation, open-vocabulary object detection, human pose estimation, and gesture detection, with the LLM serving as the central text-based coordinating unit. The qualitative and quantitative results demonstrate the huge potential of LLMs in providing emergent cognition and interactive language-oriented control of robots in a natural and social manner. <jats:bold>Video:<\/jats:bold><jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/youtu.be\/A2WLEuiM3-s\">https:\/\/youtu.be\/A2WLEuiM3-s<\/jats:ext-link>.<\/jats:p>","DOI":"10.1007\/978-3-031-72341-4_21","type":"book-chapter","created":{"date-parts":[[2024,9,16]],"date-time":"2024-09-16T13:02:55Z","timestamp":1726491775000},"page":"306-321","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":23,"title":["When Robots Get Chatty: Grounding Multimodal Human-Robot Conversation and\u00a0Collaboration"],"prefix":"10.1007","author":[{"given":"Philipp","family":"Allgeuer","sequence":"first","affiliation":[]},{"given":"Hassan","family":"Ali","sequence":"additional","affiliation":[]},{"given":"Stefan","family":"Wermter","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,9,17]]},"reference":[{"key":"21_CR1","unstructured":"Akata, E., Schulz, L., Coda-Forno, J., et\u00a0al.: Playing repeated games with large language models. arXiv preprint arXiv:2305.16867 (2023)"},{"key":"21_CR2","unstructured":"Allgeuer, P.: Improved ViLD fork. https:\/\/github.com\/pallgeuer\/tpu\/tree\/master\/models\/official\/detection\/projects\/vild"},{"key":"21_CR3","unstructured":"Brohan, A., Chebotar, Y., Finn, C., et\u00a0al.: Do as I can, not as I say: grounding language in robotic affordances. In: Conference on Robot Learning. PMLR (2023)"},{"key":"21_CR4","doi-asserted-by":"crossref","unstructured":"Casiez, G., Roussel, N., Vogel, D.: 1\u20ac filter: a simple speed-based low-pass filter for noisy input in interactive systems. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2527\u20132530 (2012)","DOI":"10.1145\/2207676.2208639"},{"key":"21_CR5","doi-asserted-by":"crossref","unstructured":"Cross, E.S., Hortensius, R., Wykowska, A.: From social brains to social robots: Applying neurocognitive insights to human-robot interaction. Philosophical Transactions of the Royal Society B 374(1771) (2019)","DOI":"10.1098\/rstb.2018.0024"},{"key":"21_CR6","unstructured":"Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J.: YOLOX: exceeding YOLO series in 2021. arXiv preprint arXiv:2107.08430 (2021)"},{"key":"21_CR7","unstructured":"Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. In: International Conference on Learning Representations (2022)"},{"key":"21_CR8","doi-asserted-by":"crossref","unstructured":"Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: Computer Vision and Pattern Recognition (2019)","DOI":"10.1109\/CVPR.2019.00550"},{"issue":"1","key":"21_CR9","doi-asserted-by":"publisher","first-page":"9","DOI":"10.1007\/s43154-020-00035-0","volume":"2","author":"A Henschel","year":"2021","unstructured":"Henschel, A., Laban, G., Cross, E.S.: What makes a robot social? A review of social robots from science fiction to a home or hospital near you. Curr. Rob. Rep. 2(1), 9\u201319 (2021). https:\/\/doi.org\/10.1007\/s43154-020-00035-0","journal-title":"Curr. Rob. Rep."},{"key":"21_CR10","doi-asserted-by":"crossref","unstructured":"Huang, C., Mees, O., Zeng, A., Burgard, W.: Visual language maps for robot navigation. In: Proceedings of the International Conference on Robotics and Automation (2023)","DOI":"10.1109\/ICRA48891.2023.10160969"},{"key":"21_CR11","unstructured":"Jiang, A.Q., et\u00a0al.: Mistral 7B. arXiv preprint arXiv:2310.06825 (2023)"},{"key":"21_CR12","doi-asserted-by":"publisher","first-page":"355","DOI":"10.1007\/978-3-031-19842-7_21","volume-title":"Computer Vision \u2013 ECCV 2022","author":"Y Kant","year":"2022","unstructured":"Kant, Y., et al.: Housekeep: tidying virtual households using commonsense reasoning. In: Avidan, S., Brostow, G., Ciss\u00e9, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision \u2013 ECCV 2022, pp. 355\u2013373. Springer, Cham (2022). https:\/\/doi.org\/10.1007\/978-3-031-19842-7_21"},{"key":"21_CR13","doi-asserted-by":"crossref","unstructured":"Kerzel, M., et al.: NICOL: a neuro-inspired collaborative semi-humanoid robot that bridges social interaction and reliable manipulation. IEEE Access 11, 123531\u2013123542 (2023)","DOI":"10.1109\/ACCESS.2023.3329370"},{"key":"21_CR14","unstructured":"Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In: International Conference on Machine Learning, pp. 5530\u20135540. PMLR (2021)"},{"key":"21_CR15","doi-asserted-by":"crossref","unstructured":"Kwon, M., Hu, H., Myers, V., Karamcheti, S., Dragan, A., Sadigh, D.: Toward grounded social reasoning. arXiv preprint arXiv:2306.08651 (2023)","DOI":"10.1109\/ICRA57147.2024.10611218"},{"key":"21_CR16","unstructured":"Radford, A., et\u00a0al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. PMLR (2021)"},{"key":"21_CR17","unstructured":"Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492\u201328518. PMLR (2023)"},{"key":"21_CR18","unstructured":"ROBOTIS: OpenManipulator-P: multi-purpose affordable manipulator for research and education (2023). https:\/\/www.robotis.us\/openmanipulator-p"},{"key":"21_CR19","unstructured":"Seed Robotics: RH8D adult-size dexterous robot hand (2023). https:\/\/www.seedrobotics.com\/rh8d-adult-robot-hand"},{"key":"21_CR20","doi-asserted-by":"crossref","unstructured":"Singh, I., Blukis, V., et al.: ProgPrompt: generating situated robot task plans using large language models. In: International Conference on Robotics and Automation (2023)","DOI":"10.1109\/ICRA48891.2023.10161317"},{"key":"21_CR21","doi-asserted-by":"publisher","unstructured":"Starke, S., Hendrich, N., Zhang, J.: A memetic evolutionary algorithm for real-time articulated kinematic motion. In: 2017 IEEE Congress on Evolutionary Computation (CEC), pp. 2473\u20132479 (2017). https:\/\/doi.org\/10.1109\/CEC.2017.7969605","DOI":"10.1109\/CEC.2017.7969605"},{"key":"21_CR22","doi-asserted-by":"crossref","unstructured":"Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of Conference on Computer Vision and Pattern Recognition, pp. 5693\u20135703 (2019)","DOI":"10.1109\/CVPR.2019.00584"},{"key":"21_CR23","unstructured":"Touvron, H., et\u00a0al.: LLaMa: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)"},{"key":"21_CR24","doi-asserted-by":"crossref","unstructured":"Vemprala, S., Bonatti, R., Bucker, A., Kapoor, A.: ChatGPT for robotics: design principles and model abilities. Microsoft Autonomous Systems and Robotics Research Group (2023)","DOI":"10.1109\/ACCESS.2024.3387941"},{"key":"21_CR25","doi-asserted-by":"publisher","first-page":"95060","DOI":"10.1109\/ACCESS.2023.3310935","volume":"11","author":"N Wake","year":"2023","unstructured":"Wake, N., Kanehira, A., Sasabuchi, K., Takamatsu, J., Ikeuchi, K.: ChatGPT empowered long-step robot control in various environments: a case application. IEEE Access 11, 95060\u201395078 (2023). https:\/\/doi.org\/10.1109\/ACCESS.2023.3310935","journal-title":"IEEE Access"},{"key":"21_CR26","doi-asserted-by":"crossref","unstructured":"Wang, C., Hasler, S., Tanneberg, D., Ocker, F., Joublin, F., et\u00a0al.: LaMI: large language models for multi-modal human-robot interaction. In: Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (2024)","DOI":"10.1145\/3613905.3651029"},{"key":"21_CR27","doi-asserted-by":"publisher","unstructured":"Wang, H., Li, J., Wu, H., Hovy, E., Sun, Y.: Pre-trained language models and their applications. Engineering 25(6), 51\u201365 (2022). https:\/\/doi.org\/10.1016\/j.eng.2022.04.024","DOI":"10.1016\/j.eng.2022.04.024"},{"key":"21_CR28","doi-asserted-by":"crossref","unstructured":"Wu, J., et\u00a0al.: Large-scale datasets for going deeper in image understanding. In: 2019 IEEE International Conference on Multimedia and Expo (ICME) (2019)","DOI":"10.1109\/ICME.2019.00256"},{"key":"21_CR29","doi-asserted-by":"publisher","unstructured":"Yamagishi, J., Veaux, C., MacDonald, K.: CSTR VCTK Corpus: English Multi-Speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92). University of Edinburgh, The Centre for Speech Technology Research (CSTR).https:\/\/doi.org\/10.7488\/ds\/2645","DOI":"10.7488\/ds\/2645"},{"key":"21_CR30","doi-asserted-by":"crossref","unstructured":"Yang, G.Z., et\u00a0al.: The grand challenges of science robotics. Sci. Rob. 3(14) (2018)","DOI":"10.1126\/scirobotics.aar7650"},{"key":"21_CR31","unstructured":"Yao, S., et al.: ReAct: synergizing reasoning and acting in language models. In: International Conference on Learning Representations (ICLR) (2023)"},{"key":"21_CR32","unstructured":"Ying, L., et al.: The neuro-symbolic inverse planning engine (NIPE): modeling probabilistic social inferences from linguistic inputs. In: ICML Workshop on Theory of Mind in Communicating Agents (2023)"},{"key":"21_CR33","doi-asserted-by":"publisher","unstructured":"You, H., Ye, Y., Zhou, T., Zhu, Q., Du, J.: Robot-enabled construction assembly with automated sequence planning based on ChatGPT: RoboGPT. Buildings 13(7) (2023). https:\/\/doi.org\/10.3390\/buildings13071772","DOI":"10.3390\/buildings13071772"},{"key":"21_CR34","doi-asserted-by":"crossref","unstructured":"Zhao, X., Li, M., Weber, C., Hafez, M.B., Wermter, S.: Chat with the environment: interactive multimodal perception using large language models. In: 2023 IEEE\/RSJ International Conference on Intelligent Robots and Systems (IROS) (2023)","DOI":"10.1109\/IROS55552.2023.10342363"}],"container-title":["Lecture Notes in Computer Science","Artificial Neural Networks and Machine Learning \u2013 ICANN 2024"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/978-3-031-72341-4_21","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,16]],"date-time":"2024-09-16T13:14:21Z","timestamp":1726492461000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/978-3-031-72341-4_21"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024]]},"ISBN":["9783031723407","9783031723414"],"references-count":34,"URL":"https:\/\/doi.org\/10.1007\/978-3-031-72341-4_21","relation":{},"ISSN":["0302-9743","1611-3349"],"issn-type":[{"value":"0302-9743","type":"print"},{"value":"1611-3349","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024]]},"assertion":[{"value":"17 September 2024","order":1,"name":"first_online","label":"First Online","group":{"name":"ChapterHistory","label":"Chapter History"}},{"value":"ICANN","order":1,"name":"conference_acronym","label":"Conference Acronym","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"International Conference on Artificial Neural Networks","order":2,"name":"conference_name","label":"Conference Name","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Lugano","order":3,"name":"conference_city","label":"Conference City","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Switzerland","order":4,"name":"conference_country","label":"Conference Country","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"2024","order":5,"name":"conference_year","label":"Conference Year","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"17 September 2024","order":7,"name":"conference_start_date","label":"Conference Start Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"20 September 2024","order":8,"name":"conference_end_date","label":"Conference End Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"33","order":9,"name":"conference_number","label":"Conference Number","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"icann2024","order":10,"name":"conference_id","label":"Conference ID","group":{"name":"ConferenceInfo","label":"Conference Information"}}]}}