{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,3]],"date-time":"2026-06-03T16:00:06Z","timestamp":1780502406679,"version":"3.54.1"},"reference-count":84,"publisher":"SAGE Publications","issue":"4","license":[{"start":{"date-parts":[[2025,1,22]],"date-time":"2025-01-22T00:00:00Z","timestamp":1737504000000},"content-version":"vor","delay-in-days":366,"URL":"http:\/\/www.sagepub.com\/licence-information-for-chorus"}],"funder":[{"name":"Samsung GRO"},{"name":"ONR","award":["N00014-22-1-2096"],"award-info":[{"award-number":["N00014-22-1-2096"]}]},{"name":"NSF","award":["IIS-2024594"],"award-info":[{"award-number":["IIS-2024594"]}]},{"name":"NSF GRFP","award":["DGE2140739"],"award-info":[{"award-number":["DGE2140739"]}]},{"name":"GoodAI Research Award"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of Robotics Research"],"published-print":{"date-parts":[[2024,4]]},"abstract":"<jats:p>To build general robotic agents that can operate in many environments, it is often useful for robots to collect experience in the real world. However, unguided experience collection is often not feasible due to safety, time, and hardware restrictions. We thus propose leveraging the next best thing as real world experience: videos of humans using their hands. To utilize these videos, we develop a method that retargets any 1st person or 3rd person video of human hands and arms into the robot hand and arm trajectories. While retargeting is a difficult problem, our key insight is to rely on only internet human hand video to train it. We use this method to present results in two areas: First, we build a system that enables any human to control a robot hand and arm, simply by demonstrating motions with their own hand. The robot observes the human operator via a single RGB camera and imitates their actions in real-time. This enables the robot to collect real-world experience safely using supervision. See these results at https:\/\/robotic-telekinesis.github.io . Second, we retarget in-the-wild human internet video into task-conditioned pseudo-robot trajectories to use as artificial robot experience. This learning algorithm leverages action priors from human hand actions, visual features from the images, and physical priors from dynamical systems to pretrain typical human behavior for a particular robot task. We show that by leveraging internet human hand experience, we need fewer robot demonstrations compared to many other methods. See these results at https:\/\/video-dex.github.io<\/jats:p>","DOI":"10.1177\/02783649241227559","type":"journal-article","created":{"date-parts":[[2024,1,23]],"date-time":"2024-01-23T02:49:00Z","timestamp":1705978140000},"page":"513-532","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":13,"title":["Learning dexterity from human hand motion in internet videos"],"prefix":"10.1177","volume":"43","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-8571-2922","authenticated-orcid":false,"given":"Kenneth","family":"Shaw","sequence":"first","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, PA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Shikhar","family":"Bahl","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, PA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Aravind","family":"Sivakumar","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, PA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3104-7560","authenticated-orcid":false,"given":"Aditya","family":"Kannan","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, PA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Deepak","family":"Pathak","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, PA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"179","published-online":{"date-parts":[[2024,1,22]]},"reference":[{"key":"bibr1-02783649241227559","first-page":"3453","volume-title":"Conference on robot learning","author":"Agarwal A","year":"2023"},{"key":"bibr2-02783649241227559","doi-asserted-by":"crossref","unstructured":"Antotsiou D, Garcia-Hernando G, Kim TK (2018) Task-oriented hand motion retargeting for dexterous manipulation imitation. In: Proceedings of the European conference on computer vision (ECCV) workshops. Munich, Germany, 8 September\u201314 September 2018.","DOI":"10.1007\/978-3-030-11024-6_19"},{"key":"bibr3-02783649241227559","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2203.13251"},{"key":"bibr4-02783649241227559","volume-title":"NeurIPS","author":"Bahl S","year":"2020"},{"key":"bibr5-02783649241227559","doi-asserted-by":"publisher","DOI":"10.15607\/RSS.2021.XVII.023"},{"key":"bibr6-02783649241227559","doi-asserted-by":"publisher","DOI":"10.15607\/RSS.2022.XVIII.026"},{"key":"bibr7-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01324"},{"key":"bibr8-02783649241227559","unstructured":"Bhat SF, Alhashim I, Wonka P (2021) Adabins: depth estimation using adaptive bins. In: Proceedings of the IEEE\/CVF Conference on computer vision and pattern recognition, Nashville, TN, USA, 20 June\u201325 June 2021, 4009\u20134018."},{"key":"bibr9-02783649241227559","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.1604.07316"},{"key":"bibr10-02783649241227559","volume-title":"JAX: Composable Transformations of Python+NumPy Programs","author":"Bradbury J","year":"2018"},{"key":"bibr11-02783649241227559","volume-title":"Language Models Are Few-Shot Learners","author":"Brown TB","year":"2020"},{"key":"bibr12-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1080\/2151237X.2005.10129202"},{"key":"bibr13-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1080\/2151237X.2005.10129202"},{"key":"bibr14-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/TRO.2021.3075644"},{"key":"bibr15-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2019.2929257"},{"key":"bibr16-02783649241227559","volume-title":"IEEE International Symposium on System Integrations (SII)","author":"Carpentier J","year":"2019"},{"key":"bibr17-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00893"},{"key":"bibr18-02783649241227559","first-page":"1597","volume-title":"Proceedings of the 37th International conference on machine learning, proceedings of machine learning research","volume":"119","author":"Chen T","year":"2020"},{"key":"bibr19-02783649241227559","doi-asserted-by":"publisher","DOI":"10.15607\/RSS.2021.XVII.012"},{"key":"bibr20-02783649241227559","volume-title":"European conference on computer vision (ECCV)","author":"Damen D","year":"2018"},{"key":"bibr21-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2013.340"},{"key":"bibr22-02783649241227559","volume-title":"Model-based Inverse Reinforcement Learning from Visual Demonstrations","author":"Das N","year":"2020"},{"key":"bibr23-02783649241227559","volume-title":"NeurIPS datasets and benchmarks track (Round 2)","author":"Dasari S","year":"2021"},{"key":"bibr24-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"bibr25-02783649241227559","volume-title":"Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding","author":"Devlin J","year":"2018"},{"key":"bibr26-02783649241227559","doi-asserted-by":"publisher","DOI":"10.26599\/TST.2018.9010096"},{"key":"bibr27-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1145\/358669.358692"},{"key":"bibr28-02783649241227559","volume-title":"Proceedings of the IEEE International conference on computer vision (ICCV)","author":"Goyal R","year":"2017"},{"key":"bibr29-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01842"},{"key":"bibr30-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1145\/3197517.3201399"},{"key":"bibr31-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA40945.2020.9197124"},{"key":"bibr32-02783649241227559","volume-title":"Deep Residual Learning for Image Recognition","author":"He K","year":"2015"},{"key":"bibr33-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.322"},{"key":"bibr34-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01553"},{"key":"bibr35-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298698"},{"key":"bibr36-02783649241227559","volume-title":"Cmu Graphics Lab Motion Capture Database","author":"Hodgins J"},{"key":"bibr37-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2013.248"},{"key":"bibr38-02783649241227559","volume-title":"Qt-opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation","author":"Kalashnikov D","year":"2018"},{"key":"bibr39-02783649241227559","volume-title":"End-to-End Recovery of Human Shape and Pose","author":"Kanazawa A","year":"2017"},{"key":"bibr40-02783649241227559","volume-title":"Deft: Dexterous Fine-Tuning for Real-World Hand Policies","author":"Kannan A","year":"2023"},{"key":"bibr41-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/HUMANOIDS.2015.7363441"},{"key":"bibr42-02783649241227559","first-page":"1179","volume":"33","author":"Kumar A","year":"2020","journal-title":"Advances in Neural Information Processing Systems"},{"key":"bibr43-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2017.63"},{"key":"bibr44-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/TMECH.2016.2634602"},{"key":"bibr45-02783649241227559","volume-title":"End-to-end Training of Deep Visuomotor Policies","author":"Levine S","year":"2016"},{"key":"bibr46-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA.2019.8794277"},{"key":"bibr47-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1145\/2816795.2818013"},{"key":"bibr48-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/ICAR.2011.6088576"},{"key":"bibr49-02783649241227559","volume-title":"Isaac Gym: High Performance Gpu-Based Physics Simulation for Robot Learning","author":"Makoviychuk V","year":"2021"},{"key":"bibr50-02783649241227559","first-page":"651","volume-title":"Conference on Robot Learning","author":"Mandikal P","year":"2022"},{"key":"bibr51-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/Humanoids57100.2023.10375195"},{"key":"bibr52-02783649241227559","doi-asserted-by":"publisher","DOI":"10.15607\/RSS.2023.XIX.012"},{"key":"bibr53-02783649241227559","first-page":"9191","volume-title":"NeurIPS","author":"Nair AV","year":"2018"},{"key":"bibr54-02783649241227559","volume-title":"R3m: A Universal Visual Representation for Robot Manipulation","author":"Nair S","year":"2022"},{"key":"bibr55-02783649241227559","volume-title":"The Surprising Effectiveness of Representation Learning for Visual Imitation","author":"Pari J","year":"2021"},{"key":"bibr56-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01123"},{"key":"bibr57-02783649241227559","volume-title":"Learning Agile Robotic Locomotion Skills by Imitating Animals","author":"Peng XB","year":"2020"},{"key":"bibr58-02783649241227559","volume-title":"The Curious Robot: Learning Visual Representations via Physical Interactions","author":"Pinto L","year":"2016"},{"key":"bibr59-02783649241227559","volume-title":"Advances in neural information processing systems","volume":"1","author":"Pomerleau DA","year":"1988"},{"key":"bibr60-02783649241227559","volume-title":"Dexmv: Imitation Learning for Dexterous Manipulation from Human Videos","author":"Qin Y","year":"2021"},{"key":"bibr61-02783649241227559","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2204.12490"},{"key":"bibr62-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1145\/3130800.3130883"},{"key":"bibr63-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW54120.2021.00201"},{"key":"bibr64-02783649241227559","volume-title":"Reinforcement Learning with Videos: Combining Offline Observations with Interaction","author":"Schmeckpeper K","year":"2020"},{"key":"bibr65-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46487-9_31"},{"key":"bibr66-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA.2018.8462891"},{"key":"bibr67-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00989"},{"key":"bibr68-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1177\/02783649211046285"},{"key":"bibr69-02783649241227559","volume-title":"Third-Person Visual Imitation Learning via Decoupled Hierarchical Controller","author":"Sharma P","year":"2019"},{"key":"bibr70-02783649241227559","author":"Shaw K","year":"2023","journal-title":"RSS"},{"key":"bibr71-02783649241227559","first-page":"654","volume-title":"Conference on robot learning","author":"Shaw K","year":"2023"},{"key":"bibr72-02783649241227559","volume-title":"Very Deep Convolutional Networks for Large-Scale Image Recognition","author":"Simonyan K","year":"2014"},{"key":"bibr73-02783649241227559","volume-title":"Robotic Telekinesis: Learning a Robotic Hand Imitator by Watching Humans on Youtube","author":"Sivakumar A","year":"2022"},{"key":"bibr74-02783649241227559","volume-title":"Avid: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos","author":"Smith L","year":"2020"},{"key":"bibr75-02783649241227559","volume-title":"MuJoCo: A Physics Engine for Model-Based Control","author":"Todorov E","year":"2012"},{"key":"bibr76-02783649241227559","unstructured":"UFactory (n.d) xarm6 by ufactory. https:\/\/www.ufactory.cc\/xarm-collaborative-robot"},{"key":"bibr77-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/34.88573"},{"key":"bibr78-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00901"},{"key":"bibr79-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1145\/3414685.3417788"},{"key":"bibr80-02783649241227559","volume-title":"Masked Visual Pre-training for Motor Control","author":"Xiao T","year":"2022"},{"key":"bibr81-02783649241227559","volume-title":"Visual Imitation Made Easy","author":"Young S","year":"2020"},{"key":"bibr82-02783649241227559","volume-title":"Xirl: Cross-Embodiment Inverse Reinforcement Learning","author":"Zakka K","year":"2021"},{"key":"bibr83-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-20077-9_21"},{"key":"bibr84-02783649241227559","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00090"}],"container-title":["The International Journal of Robotics Research"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/02783649241227559","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/02783649241227559","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/02783649241227559","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/02783649241227559","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T10:17:14Z","timestamp":1777457834000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/02783649241227559"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,22]]},"references-count":84,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,4]]}},"alternative-id":["10.1177\/02783649241227559"],"URL":"https:\/\/doi.org\/10.1177\/02783649241227559","relation":{},"ISSN":["0278-3649","1741-3176"],"issn-type":[{"value":"0278-3649","type":"print"},{"value":"1741-3176","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,1,22]]}}}