{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T05:13:12Z","timestamp":1755839592822,"version":"3.37.3"},"reference-count":45,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2023,8,8]],"date-time":"2023-08-08T00:00:00Z","timestamp":1691452800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,8,8]],"date-time":"2023-08-08T00:00:00Z","timestamp":1691452800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Complex Intell. Syst."],"published-print":{"date-parts":[[2024,2]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Learning visual predictive models has great potential for real-world robot manipulations. Visual predictive models serve as a model of real-world dynamics to comprehend the interactions between the robot and objects. However, prior works in the literature have focused mainly on low-level elementary robot actions, which typically result in lengthy, inefficient, and highly complex robot manipulation. In contrast, humans usually employ top\u2013down thinking of high-level actions rather than bottom\u2013up stacking of low-level ones. To address this limitation, we present a novel formulation for robot manipulation that can be accomplished by pick-and-place, a commonly applied high-level robot action, through grasping. We propose a novel visual predictive model that combines an action decomposer and a video prediction network to learn the intrinsic semantic information of high-level actions. Experiments show that our model can accurately predict the object dynamics (i.e., the object movements under robot manipulation) while trained directly on observations of high-level pick-and-place actions. We also demonstrate that, together with a sampling-based planner, our model achieves a higher success rate using high-level actions on a variety of real robot manipulation tasks.<\/jats:p>","DOI":"10.1007\/s40747-023-01174-5","type":"journal-article","created":{"date-parts":[[2023,8,8]],"date-time":"2023-08-08T06:01:46Z","timestamp":1691474506000},"page":"811-823","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Learning high-level robotic manipulation actions with visual predictive model"],"prefix":"10.1007","volume":"10","author":[{"given":"Anji","family":"Ma","sequence":"first","affiliation":[]},{"given":"Guoyi","family":"Chi","sequence":"additional","affiliation":[]},{"given":"Serena","family":"Ivaldi","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8508-6699","authenticated-orcid":false,"given":"Lipeng","family":"Chen","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,8,8]]},"reference":[{"key":"1174_CR1","unstructured":"Babaeizadeh M, Finn C, Erhan D, Campbell R, Levine S (2018) Stochastic variational video prediction. In: 6th international conference on learning representations, ICLR 2018"},{"key":"1174_CR2","doi-asserted-by":"crossref","unstructured":"Chen B, Wang W, Wang J (2017) Video imagination from a single image with transformation generation. In: Proceedings of the on thematic workshops of ACM multimedia 2017, pp 358\u2013366","DOI":"10.1145\/3126686.3126737"},{"key":"1174_CR3","unstructured":"Dasari S, Ebert F, Tian S, Nair S, Bucher B, Schmeckpeper K, Singh S, Levine S, Finn C (2019) Robonet: large-scale multi-robot learning. In: CoRL"},{"key":"1174_CR4","unstructured":"Dasari S, Ebert F, Tian S, Nair S, Bucher B, Schmeckpeper K, Singh S, Levine S, Finn C (2019) Robonet: large-scale multi-robot learning. CoRR arXiv:1910.11215. http:\/\/arxiv.org\/abs\/1910.11215"},{"key":"1174_CR5","doi-asserted-by":"crossref","unstructured":"Deisenroth MP, Englert P, Peters J, Fox D (2014) Multi-task policy search for robotics. In: 2014 IEEE international conference on robotics and automation (ICRA). IEEE, pp 3876\u20133881","DOI":"10.1109\/ICRA.2014.6907421"},{"key":"1174_CR6","doi-asserted-by":"crossref","unstructured":"Deisenroth MP, Neumann G, Peters J et\u00a0al (2013) A survey on policy search for robotics. Found Trends\u00ae Robot 2(1\u20132):1\u2013142","DOI":"10.1561\/2300000021"},{"key":"1174_CR7","unstructured":"Denton E, Fergus R (2018) Stochastic video generation with a learned prior. In: International conference on machine learning. PMLR, pp 1174\u20131183"},{"issue":"4","key":"1174_CR8","doi-asserted-by":"publisher","first-page":"3021","DOI":"10.1007\/s40747-021-00319-8","volume":"8","author":"R Divya","year":"2022","unstructured":"Divya R, Peter JD (2022) Smart healthcare system-a brain-like computing approach for analyzing the performance of detectron2 and posenet models for anomalous action detection in aged people with movement impairments. Complex Intell Syst 8(4):3021\u20133040","journal-title":"Complex Intell Syst"},{"key":"1174_CR9","unstructured":"Ebert F, Finn C, Dasari S, Xie A, Lee A, Levine S (2018) Visual foresight: model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568"},{"key":"1174_CR10","unstructured":"Ebert F, Finn C, Lee AX, Levine S (2017) Self-supervised visual planning with temporal skip connections. In: CoRL, pp 344\u2013356"},{"key":"1174_CR11","unstructured":"Finn C, Goodfellow I, Levine S (2016) Unsupervised learning for physical interaction through video prediction. In: Advances in neural information processing systems, vol 29"},{"key":"1174_CR12","doi-asserted-by":"crossref","unstructured":"Finn C, Levine S (2017) Deep visual foresight for planning robot motion. In: 2017 IEEE international conference on robotics and automation (ICRA). IEEE, pp 2786\u20132793","DOI":"10.1109\/ICRA.2017.7989324"},{"issue":"6","key":"1174_CR13","doi-asserted-by":"publisher","first-page":"1077","DOI":"10.1109\/TRO.2005.852260","volume":"21","author":"E Frazzoli","year":"2005","unstructured":"Frazzoli E, Dahleh MA, Feron E (2005) Maneuver-based motion planning for nonlinear systems with symmetries. IEEE Trans Robot 21(6):1077\u20131091","journal-title":"IEEE Trans Robot"},{"key":"1174_CR14","doi-asserted-by":"crossref","unstructured":"Fu J, Levine S, Abbeel P (2016) One-shot learning of manipulation skills with online dynamics adaptation and neural network priors. In: 2016 IEEE\/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 4019\u20134026","DOI":"10.1109\/IROS.2016.7759592"},{"key":"1174_CR15","unstructured":"Gal Y, McAllister R, Rasmussen CE (2016) Improving pilco with Bayesian neural network dynamics models. In: Data-efficient machine learning workshop, vol\u00a04. ICML, p\u00a025"},{"issue":"11","key":"1174_CR16","doi-asserted-by":"publisher","first-page":"1231","DOI":"10.1177\/0278364913491297","volume":"32","author":"A Geiger","year":"2013","unstructured":"Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: the Kitti dataset. Int J Robot Res 32(11):1231\u20131237","journal-title":"Int J Robot Res"},{"key":"1174_CR17","doi-asserted-by":"crossref","unstructured":"Gu S, Holly E, Lillicrap T, Levine S (2017) Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In: 2017 IEEE international conference on robotics and automation (ICRA). IEEE, pp 3389\u20133396","DOI":"10.1109\/ICRA.2017.7989385"},{"key":"1174_CR18","unstructured":"Gualtieri M, Platt R (2018) Learning 6-DoF grasping and pick-place using attention focus. In: Conference on robot learning. PMLR, pp 477\u2013486"},{"key":"1174_CR19","unstructured":"Gupta A, Tian S, Zhang Y, Wu J, Mart\u00edn-Mart\u00edn R, Fei-Fei L (2022) Maskvit: masked visual pre-training for video prediction. arXiv preprint arXiv:2206.11894"},{"key":"1174_CR20","unstructured":"Hafner D, Lillicrap T, Fischer I, Villegas R, Ha D, Lee H, Davidson J (2019) Learning latent dynamics for planning from pixels. In: International conference on machine learning. PMLR, pp 2555\u20132565"},{"key":"1174_CR21","unstructured":"Ho J, Salimans T, Gritsenko A, Chan W, Norouzi M, Fleet DJ (2022) Video diffusion models. arXiv preprint arXiv:2204.03458"},{"issue":"13","key":"1174_CR22","doi-asserted-by":"publisher","first-page":"800","DOI":"10.1049\/el:20080522","volume":"44","author":"Q Huynh-Thu","year":"2008","unstructured":"Huynh-Thu Q, Ghanbari M (2008) Scope of validity of PSNR in image\/video quality assessment. Electron Lett 44(13):800\u2013801","journal-title":"Electron Lett"},{"issue":"7","key":"1174_CR23","doi-asserted-by":"publisher","first-page":"1325","DOI":"10.1109\/TPAMI.2013.248","volume":"36","author":"C Ionescu","year":"2013","unstructured":"Ionescu C, Papava D, Olaru V, Sminchisescu C (2013) Human3.6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans Pattern Anal Mach Intell 36(7):1325\u20131339","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"1174_CR24","unstructured":"Kingma DP, Welling M (2013) Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114"},{"key":"1174_CR25","unstructured":"Lee AX, Zhang R, Ebert F, Abbeel P, Finn C, Levine S (2018) Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523"},{"key":"1174_CR26","unstructured":"Levine S, Koltun V (2014) Learning complex neural network policies with trajectory optimization. In: International conference on machine learning. PMLR, pp 829\u2013837"},{"key":"1174_CR27","doi-asserted-by":"crossref","unstructured":"Ma A, Fleytoux Y, Mouret JB, Ivaldi S (2021) VP-GO: a \u201clight\u201d action-conditioned visual prediction model. arXiv preprint arXiv:2109.12694","DOI":"10.1109\/ICARM54641.2022.9959321"},{"key":"1174_CR28","doi-asserted-by":"crossref","unstructured":"Mahler J, Matl M, Liu X, Li A, Gealy D, Goldberg K (2018) Dex-net 3.0: computing robust vacuum suction grasp targets in point clouds using a new analytic model and deep learning. In: 2018 IEEE international conference on robotics and automation (ICRA). IEEE, pp 5620\u20135627","DOI":"10.1109\/ICRA.2018.8460887"},{"issue":"5","key":"1174_CR29","doi-asserted-by":"publisher","first-page":"3613","DOI":"10.1007\/s40747-021-00397-8","volume":"8","author":"K Pasupa","year":"2022","unstructured":"Pasupa K, Kittiworapanya P, Hongngern N, Woraratpanya K (2022) Evaluation of deep learning algorithms for semantic segmentation of car parts. Complex Intell Syst 8(5):3613\u20133625","journal-title":"Complex Intell Syst"},{"key":"1174_CR30","doi-asserted-by":"crossref","unstructured":"Seita D, Florence P, Tompson J, Coumans E, Sindhwani V, Goldberg K, Zeng A (2021) Learning to rearrange deformable cables, fabrics, and bags with goal-conditioned transporter networks. In: 2021 IEEE international conference on robotics and automation (ICRA). IEEE, pp 4568\u20134575","DOI":"10.1109\/ICRA48506.2021.9561391"},{"key":"1174_CR31","unstructured":"Sekar R, Rybkin O, Daniilidis K, Abbeel P, Hafner D, Pathak D (2020) Planning to explore via self-supervised world models. In: International conference on machine learning. PMLR, pp 8583\u20138592"},{"key":"1174_CR32","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-32552-1","volume-title":"Springer handbook of robotics","author":"B Siciliano","year":"2016","unstructured":"Siciliano B, Khatib O (2016) Springer handbook of robotics. Springer, Berlin"},{"key":"1174_CR33","unstructured":"Silver D, Hasselt H, Hessel M, Schaul T, Guez A, Harley T, Dulac-Arnold G, Reichert D, Rabinowitz N, Barreto A et\u00a0al (2017) The predictron: end-to-end learning and planning. In: International conference on machine learning. PMLR, pp 3191\u20133199"},{"issue":"9","key":"1174_CR34","doi-asserted-by":"publisher","first-page":"3487","DOI":"10.1109\/JSEN.2018.2888815","volume":"19","author":"L Sun","year":"2018","unstructured":"Sun L, Zhao C, Yan Z, Liu P, Duckett T, Stolkin R (2018) A novel weakly-supervised approach for RGB-D-based nuclear waste object detection. IEEE Sens J 19(9):3487\u20133500","journal-title":"IEEE Sens J"},{"key":"1174_CR35","doi-asserted-by":"crossref","unstructured":"Walker J, Gupta A, Hebert M (2015) Dense optical flow prediction from a static image. In: Proceedings of the IEEE international conference on computer vision, pp 2443\u20132451","DOI":"10.1109\/ICCV.2015.281"},{"issue":"4","key":"1174_CR36","doi-asserted-by":"publisher","first-page":"600","DOI":"10.1109\/TIP.2003.819861","volume":"13","author":"Z Wang","year":"2004","unstructured":"Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600\u2013612","journal-title":"IEEE Trans Image Process"},{"issue":"3","key":"1174_CR37","doi-asserted-by":"publisher","first-page":"392","DOI":"10.1075\/is.10.3.06wis","volume":"10","author":"T Wisspeintner","year":"2009","unstructured":"Wisspeintner T, Van Der Zant T, Iocchi L, Schiffer S (2009) Robocup@ home: scientific competition and benchmarking for domestic service robots. Interact Stud 10(3):392\u2013426","journal-title":"Interact Stud"},{"key":"1174_CR38","doi-asserted-by":"crossref","unstructured":"Wong JM, Kee V, Le T, Wagner S, Mariottini GL, Schneider A, Hamilton L, Chipalkatty R, Hebert M, Johnson DM, et\u00a0al (2017) Segicp: integrated deep semantic segmentation and pose estimation. In: 2017 IEEE\/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 5784\u20135789","DOI":"10.1109\/IROS.2017.8206470"},{"key":"1174_CR39","doi-asserted-by":"crossref","unstructured":"Wu B, Nair S, Martin-Martin R, Fei-Fei L, Finn C (2021) Greedy hierarchical variational autoencoders for large-scale video prediction. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 2318\u20132328","DOI":"10.1109\/CVPR46437.2021.00235"},{"key":"1174_CR40","doi-asserted-by":"crossref","unstructured":"Wu B, Nair S, Martin-Martin R, Fei-Fei L, Finn C (2021) Greedy hierarchical variational autoencoders for large-scale video prediction. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 2318\u20132328","DOI":"10.1109\/CVPR46437.2021.00235"},{"key":"1174_CR41","doi-asserted-by":"crossref","unstructured":"Yoon Y, DeSouza GN, Kak AC (2003) Real-time tracking and pose estimation for industrial objects using geometric features. In: 2003 IEEE international conference on robotics and automation (cat. no. 03CH37422), vol\u00a03. IEEE, pp 3473\u20133478","DOI":"10.1109\/ROBOT.2003.1242127"},{"key":"1174_CR42","unstructured":"Zeng A, Florence P, Tompson J, Welker S, Chien J, Attarian M, Armstrong T, Krasin I, Duong D, Sindhwani V et\u00a0al (2020) Transporter networks: rearranging the visual world for robotic manipulation. arXiv preprint arXiv:2010.14406"},{"key":"1174_CR43","unstructured":"Zeng A, Florence P, Tompson J, Welker S, Chien J, Attarian M, Armstrong T, Krasin I, Duong D, Sindhwani V et\u00a0al (2021) Transporter networks: rearranging the visual world for robotic manipulation. In: Conference on robot learning. PMLR, pp 726\u2013747"},{"key":"1174_CR44","doi-asserted-by":"crossref","unstructured":"Zeng A, Yu KT, Song S, Suo D, Walker E, Rodriguez A, Xiao J (2017) Multi-view self-supervised deep learning for 6D pose estimation in the amazon picking challenge. In: 2017 IEEE international conference on robotics and automation (ICRA). IEEE, pp 1386\u20131383","DOI":"10.1109\/ICRA.2017.7989165"},{"key":"1174_CR45","doi-asserted-by":"crossref","unstructured":"Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018) The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 586\u2013595","DOI":"10.1109\/CVPR.2018.00068"}],"container-title":["Complex &amp; Intelligent Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-023-01174-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s40747-023-01174-5\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-023-01174-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,2,10]],"date-time":"2024-02-10T22:23:28Z","timestamp":1707603808000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s40747-023-01174-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,8,8]]},"references-count":45,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,2]]}},"alternative-id":["1174"],"URL":"https:\/\/doi.org\/10.1007\/s40747-023-01174-5","relation":{},"ISSN":["2199-4536","2198-6053"],"issn-type":[{"type":"print","value":"2199-4536"},{"type":"electronic","value":"2198-6053"}],"subject":[],"published":{"date-parts":[[2023,8,8]]},"assertion":[{"value":"13 October 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 June 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 August 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that we have no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}