{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,6]],"date-time":"2026-03-06T03:59:11Z","timestamp":1772769551567,"version":"3.50.1"},"reference-count":46,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2026,3,6]],"date-time":"2026-03-06T00:00:00Z","timestamp":1772755200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,3,6]],"date-time":"2026-03-06T00:00:00Z","timestamp":1772755200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100003977","name":"Israel Science Foundation","doi-asserted-by":"publisher","award":["451\/24"],"award-info":[{"award-number":["451\/24"]}],"id":[{"id":"10.13039\/501100003977","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004375","name":"Tel Aviv University","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100004375","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Intel Serv Robotics"],"published-print":{"date-parts":[[2026,5]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Grasping unknown objects in unstructured environments is a critical challenge for service robots, which must operate in dynamic, real-world settings such as homes, hospitals, and warehouses. Success in these environments requires both semantic understanding and spatial reasoning. Traditional methods often rely on dense training datasets or detailed geometric modeling, which demand extensive data collection and do not generalize well to novel objects or affordances. We present ORACLE-Grasp, a zero-shot framework that leverages large multimodal models (LMMs) as semantic oracles to guide affordance-aligned grasp selection, without requiring task-specific training or manual input. The system reformulates grasp prediction as a structured, iterative decision process, using a dual-prompt tool-calling strategy: the first prompt extracts high-level object semantics, while the second identifies graspable regions aligned with the object\u2019s function. To address the spatial limitations of LMMs, ORACLE-Grasp discretizes the image into candidate regions and reasons over them to produce human-like and context-sensitive grasp suggestions. A depth-based refinement step improves grasp reliability when available, and an early stopping mechanism enhances computational efficiency. We evaluate ORACLE-Grasp on a diverse set of RGB and RGB-D images featuring both everyday and AI-generated objects. The results show that our method produces physically feasible and semantically appropriate grasps that align closely with human annotations, achieving high success rates in real-world pick-up tasks. Our findings highlight the potential of LMMs for enabling flexible and generalizable grasping strategies in autonomous service robots, eliminating the need for object-specific models or extensive training.<\/jats:p>","DOI":"10.1007\/s11370-026-00707-4","type":"journal-article","created":{"date-parts":[[2026,3,6]],"date-time":"2026-03-06T02:52:31Z","timestamp":1772765551000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Oracle-grasp: zero-shot affordance-aligned robotic grasping using large multimodal models"],"prefix":"10.1007","volume":"19","author":[{"given":"Avihai","family":"Giuili","sequence":"first","affiliation":[]},{"given":"Rotem","family":"Atari","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3320-3897","authenticated-orcid":false,"given":"Avishai","family":"Sintov","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2026,3,6]]},"reference":[{"key":"707_CR1","doi-asserted-by":"crossref","unstructured":"Eppner C, Brock O (2015) Planning grasp strategies that exploit environmental constraints. In: IEEE international conference on robotics and automation (ICRA), pp 4947\u20134952","DOI":"10.1109\/ICRA.2015.7139886"},{"issue":"4\u20135","key":"707_CR2","doi-asserted-by":"publisher","first-page":"421","DOI":"10.1177\/0278364917710318","volume":"37","author":"S Levine","year":"2018","unstructured":"Levine S, Pastor P, Krizhevsky A, Ibarz J, Quillen D (2018) Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int J Robot Res 37(4\u20135):421\u2013436","journal-title":"Int J Robot Res"},{"key":"707_CR3","doi-asserted-by":"crossref","unstructured":"Zeng A, Song S, Welker S, Lee J, Rodriguez A, Funkhouser T (2018) Learning synergies between pushing and grasping with self-supervised deep reinforcement learning. In: IEEE international conference on intelligent robots and systems (IROS)","DOI":"10.1109\/IROS.2018.8593986"},{"issue":"26","key":"707_CR4","doi-asserted-by":"publisher","first-page":"4984","DOI":"10.1126\/scirobotics.aau4984","volume":"4","author":"J Mahler","year":"2019","unstructured":"Mahler J, Matl M, Satish V, Danielczuk M, DeRose B, McKinley S, Goldberg K (2019) Learning ambidextrous robot grasping policies. Sci Robot 4(26):4984","journal-title":"Sci Robot"},{"key":"707_CR5","doi-asserted-by":"publisher","first-page":"289","DOI":"10.1109\/TRO.2013.2289018","volume":"30","author":"J Bohg","year":"2014","unstructured":"Bohg J, Morales A, Asfour T, Kragic D (2014) Data-driven grasp synthesis-a survey. IEEE Trans Rob 30:289\u2013309","journal-title":"IEEE Trans Rob"},{"key":"707_CR6","doi-asserted-by":"publisher","DOI":"10.1016\/j.rcim.2023.102644","volume":"86","author":"Y Huang","year":"2024","unstructured":"Huang Y, Liu D, Liu Z, Wang K, Wang Q, Tan J (2024) A novel robotic grasping method for moving objects based on multi-agent deep reinforcement learning. Robot Comput Integr Manuf 86:102644","journal-title":"Robot Comput Integr Manuf"},{"issue":"3","key":"707_CR7","doi-asserted-by":"publisher","first-page":"326","DOI":"10.1016\/j.robot.2011.07.016","volume":"60","author":"A Sahbani","year":"2012","unstructured":"Sahbani A, El-Khoury S, Bidaud P (2012) An overview of 3d object grasp synthesis algorithms. Robot Auton Syst 60(3):326\u2013336 (Autonomous Grasping)","journal-title":"Robot Auton Syst"},{"key":"707_CR8","doi-asserted-by":"crossref","unstructured":"Morrison D, Corke P, Leitner J (2018) Closing the loop for robotic grasping: a real-time, generative grasp synthesis approach. In: Robotics: science and systems","DOI":"10.15607\/RSS.2018.XIV.021"},{"issue":"15","key":"707_CR9","doi-asserted-by":"publisher","first-page":"4861","DOI":"10.3390\/s24154861","volume":"24","author":"KS Khor","year":"2024","unstructured":"Khor KS, Liu C, Cheah CC (2024) Robotic grasping of unknown objects based on deep learning-based feature detection. Sensors 24(15):4861","journal-title":"Sensors"},{"key":"707_CR10","doi-asserted-by":"publisher","first-page":"157","DOI":"10.1177\/0278364907087172","volume":"27","author":"A Saxena","year":"2008","unstructured":"Saxena A, Driemeyer J, Ng AY (2008) Robotic grasping of novel objects using vision. Int J Robot Res 27:157\u2013173","journal-title":"Int J Robot Res"},{"key":"707_CR11","doi-asserted-by":"crossref","unstructured":"Mahler J, Liang J, Niyaz S, Laskey M, Doan R, Liu X, Aparicio J, Goldberg K (2017) Dex-net 2.0: deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. In: Robotics: science and systems","DOI":"10.15607\/RSS.2017.XIII.058"},{"issue":"2","key":"707_CR12","doi-asserted-by":"publisher","first-page":"2286","DOI":"10.1109\/LRA.2020.2969946","volume":"5","author":"L Shao","year":"2020","unstructured":"Shao L, Ferreira F, Jorda M, Nambiar V, Luo J, Solowjow E, Ojea J, Khatib O, Bohg J (2020) Unigrasp: learning a unified model to grasp with multifingered robotic hands. IEEE Robot Autom Lett 5(2):2286\u20132293","journal-title":"IEEE Robot Autom Lett"},{"issue":"7","key":"707_CR13","doi-asserted-by":"publisher","first-page":"690","DOI":"10.1177\/0278364919868017","volume":"41","author":"A Zeng","year":"2022","unstructured":"Zeng A, Song S, Yu KT, Donlon E, Hogan FR, Bauza M et al (2022) Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching. Inter J Robot Res 41(7):690\u2013705","journal-title":"Inter J Robot Res"},{"key":"707_CR14","doi-asserted-by":"crossref","unstructured":"Dang H, Allen PK (2012) Semantic grasping: Planning robotic grasps functionally suitable for an object manipulation task. In: IEEE\/RSJ international conference on intelligent robots and systems, pp 1311\u20131317","DOI":"10.1109\/IROS.2012.6385563"},{"key":"707_CR15","unstructured":"Minaee S, Mikolov T, Nikzad N, Chenaghlu M, Socher R, Amatriain X, Gao J (2024) Large language models: a survey. Preprint at arXiv:2402.06196"},{"key":"707_CR16","unstructured":"Wang J, Jiang H, Liu Y, Ma C, Zhang X, Pan Y, Liu M, Gu P, Xia S, Li W et al (2024) A comprehensive review of multimodal large language models: performance and challenges across different tasks. Preprint at arXiv:2408.01319"},{"key":"707_CR17","doi-asserted-by":"crossref","unstructured":"Wang J, Shi E, Hu H, Ma C, Liu Y, Wang X, Yao Y, Liu X, Ge B, Zhang S (2025) Large language models for robotics: opportunities, challenges, and perspectives. J Autom Intell 4(1):52\u201364","DOI":"10.1016\/j.jai.2024.12.003"},{"key":"707_CR18","doi-asserted-by":"crossref","unstructured":"Guo D, Xiang Y, Zhao S, Zhu X, Tomizuka M, Ding M, Zhan W (2024) PhyGrasp: generalizing robotic grasping with physics-informed large multimodal models. Preprint at arXiv:2402.16836","DOI":"10.1109\/IROS60139.2025.11246481"},{"key":"707_CR19","unstructured":"Tziafas G, Yucheng X, Goel A, Kasaei M, Li Z, Kasaei H (2023) Language-guided robot grasping: clip-based referring grasp synthesis in clutter. In: 7th Annual conference on robot learning"},{"key":"707_CR20","unstructured":"Jin S, XU J, Lei Y, Zhang L (2024) Reasoning grasping via multimodal large language model. In: Conference on robot learning"},{"key":"707_CR21","doi-asserted-by":"crossref","unstructured":"Xu J, Jin S, Lei Y, Zhang Y, Zhang L (2024) RT-Grasp: Reasoning tuning robotic grasping via multi-modal large language model. In: IEEE\/RSJ international conference on intelligent robots and systems, pp. 7323\u20137330","DOI":"10.1109\/IROS58592.2024.10801718"},{"key":"707_CR22","unstructured":"Li H, Mao W, Deng W, Meng C, Fan H, Wang T, Tan P, Wang H, Deng X (2025) Multi-GraspLLM: A multimodal LLM for multi-hand semantic guided grasp generation. Preprint at arXiv:2412.08468"},{"key":"707_CR23","doi-asserted-by":"crossref","unstructured":"Li S, Bhagat S, Campbell J, Xie Y, Kim W, Sycara K, Stepputtis S (2024) Shapegrasp: Zero-shot task-oriented grasping with large language models through geometric decomposition. In: IEEE\/RSJ international conference on intelligent robots and systems (IROS), pp 10527\u201310534","DOI":"10.1109\/IROS58592.2024.10801661"},{"issue":"11","key":"707_CR24","doi-asserted-by":"publisher","first-page":"7551","DOI":"10.1109\/LRA.2023.3320012","volume":"8","author":"C Tang","year":"2023","unstructured":"Tang C, Huang D, Ge W, Liu W, Zhang H (2023) Graspgpt: Leveraging semantic knowledge from a large language model for task-oriented grasping. IEEE Robot Autom Lett 8(11):7551\u20137558","journal-title":"IEEE Robot Autom Lett"},{"key":"707_CR25","doi-asserted-by":"publisher","first-page":"12418","DOI":"10.1109\/TASE.2025.3542418","volume":"22","author":"C Tang","year":"2025","unstructured":"Tang C, Huang D, Dong W, Xu R, Zhang H (2025) Foundationgrasp: Generalizable task-oriented grasping with foundation models. IEEE Trans Autom Sci Eng 22:12418\u201312435","journal-title":"IEEE Trans Autom Sci Eng"},{"key":"707_CR26","doi-asserted-by":"crossref","unstructured":"Minderer M, Vo M, Zhai X, Alayrac JB, Lorenz D, Pinto L, Zoph B, Barham P, Dinculescu M, Houlsby N et al (2022) Simple open-vocabulary object detection with vision transformers. European conference on computer vision","DOI":"10.1007\/978-3-031-20080-9_42"},{"key":"707_CR27","unstructured":"Shridhar M, Weng T, Jain A, Xu D, Wang Y et al (2023) Perceiver-actor: a multimodal transformer for robotic manipulation. In: Robotics: science and systems (RSS)"},{"key":"707_CR28","unstructured":"Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi E H, Le QV, Zhou D (2022) Chain-of-thought prompting elicits reasoning in large language models. In: International conference on neural information processing systems"},{"key":"707_CR29","unstructured":"Ahn M, Brohan A, Tobin J, Abbeel P, Sermanet P (2022) Do as i can, not as i say: Grounding language in robotic affordances. In: Robotics: science and systems (RSS)"},{"key":"707_CR30","unstructured":"Shridhar M, Manuelli L, Fox D (2023) CLIPort: What and where pathways for robotic manipulation. In: Conference on robot learning"},{"key":"707_CR31","first-page":"2165","volume":"229","author":"B Zitkovich","year":"2023","unstructured":"Zitkovich B, Yu T, Xu S, Xu P et al (2023) Rt-2: Vision-language-action models transfer web knowledge to robotic control. Conf Robot Learn 229:2165\u20132183","journal-title":"Conf Robot Learn"},{"key":"707_CR32","unstructured":"Driess D, Shridhar M, Ebert F et al (2023) PaLM-E: An embodied multimodal language model. Preprint at arXiv:2303.03378"},{"key":"707_CR33","unstructured":"OpenAI (2024) Gpt-4o system card. Preprint at arXiv:2410.21276"},{"key":"707_CR34","doi-asserted-by":"crossref","unstructured":"Alayrac JB, Donahue J, Luc P, Miech A, Barr I, Hasson Y et al (2022) Flamingo: a visual language model for few-shot learning. In: International conference on neural information processing systems","DOI":"10.52202\/068431-1723"},{"key":"707_CR35","unstructured":"Liu H, Li C, Wu Q, Lee YJ (2023) Visual instruction tuning. In: NeurIPS"},{"key":"707_CR36","doi-asserted-by":"crossref","unstructured":"Zhang D, Yu Y, Dong J, Li C, Su D, Chu C, Yu D (2024) MM-LLMs: Recent advances in multimodal large language models. In: Findings of the association for computational linguistics, pp 12401\u201312430","DOI":"10.18653\/v1\/2024.findings-acl.738"},{"key":"707_CR37","doi-asserted-by":"crossref","unstructured":"Long Z, Killick G, McCreadie R, Aragon-Camarasa G (2024) RoboLLM: Robotic vision tasks grounded on multimodal large language models. In: IEEE international conference on robotics and automation","DOI":"10.1109\/ICRA57147.2024.10610797"},{"issue":"4","key":"707_CR38","doi-asserted-by":"publisher","first-page":"592","DOI":"10.1038\/s42256-025-01005-x","volume":"7","author":"R Mon-Williams","year":"2025","unstructured":"Mon-Williams R, Li G, Long R, Du W, Lucas C (2025) Embodied large language models enable robots to complete complex tasks in unpredictable environments. Nat Mach Intel 7(4):592\u2013601","journal-title":"Nat Mach Intel"},{"key":"707_CR39","doi-asserted-by":"crossref","unstructured":"Singh R, Xu D et al (2023) Progprompt: Generating situated robot action plans with large language models. In: Conference on robot learning (CoRL)","DOI":"10.1109\/ICRA48891.2023.10161317"},{"key":"707_CR40","doi-asserted-by":"crossref","unstructured":"Lykov A, Litvinov M, Konenkov M, Prochii R, Burtsev N, Abdulkarim AA, Bazhenov A, Berman V, Tsetserukou D (2024) Cognitivedog: large multimodal model based system to translate vision and language into action of quadruped robot. In: ACM\/IEEE international conference on human-robot interaction, pp 712\u2013716","DOI":"10.1145\/3610978.3641080"},{"key":"707_CR41","doi-asserted-by":"crossref","unstructured":"Tellex S, Thaker P, Joseph T, Kollar T, Roy N (2011) Understanding natural language commands for robotic navigation and mobile manipulation. In: AAAI conference on artificial intelligence, vol. 25, pp 1507\u20131514","DOI":"10.1609\/aaai.v25i1.7979"},{"key":"707_CR42","unstructured":"Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning"},{"key":"707_CR43","doi-asserted-by":"crossref","unstructured":"Tang C, Huang D, Meng L, Liu W, Zhang H (2023) Task-oriented grasp prediction with visual-language inputs. In: IEEE\/RSJ international conference on intelligent robots and systems (IROS), pp 4881\u20134888","DOI":"10.1109\/IROS55552.2023.10342268"},{"key":"707_CR44","doi-asserted-by":"crossref","unstructured":"Vuong AD, Vu MN, Huang B, Nguyen N, Le H, Vo T, Nguyen A (2024) Language-driven grasp detection. In: IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp. 17902\u201317912","DOI":"10.1109\/CVPR52733.2024.01695"},{"key":"707_CR45","unstructured":"Mirjalili R, Krawez M, Silenzi S, Blei Y, Burgard W (2024) LAN-grasp: Using large language models for semantic object grasping. Preprint at arXiv:2310.05239"},{"key":"707_CR46","unstructured":"Grattafiori A, Dubey A, Jauhri A, Pandey A, Kadian A et al (2024) The Llama 3 herd of models. Preprint at arXiv:2407.21783"}],"container-title":["Intelligent Service Robotics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11370-026-00707-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11370-026-00707-4","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11370-026-00707-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,6]],"date-time":"2026-03-06T02:52:40Z","timestamp":1772765560000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11370-026-00707-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,6]]},"references-count":46,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2026,5]]}},"alternative-id":["707"],"URL":"https:\/\/doi.org\/10.1007\/s11370-026-00707-4","relation":{},"ISSN":["1861-2776","1861-2784"],"issn-type":[{"value":"1861-2776","type":"print"},{"value":"1861-2784","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,3,6]]},"assertion":[{"value":"21 July 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 February 2026","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"6 March 2026","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that there is no Conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}],"article-number":"46"}}