{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,11]],"date-time":"2025-09-11T19:14:58Z","timestamp":1757618098732,"version":"3.44.0"},"reference-count":35,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2025,5,21]],"date-time":"2025-05-21T00:00:00Z","timestamp":1747785600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,5,21]],"date-time":"2025-05-21T00:00:00Z","timestamp":1747785600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100005405","name":"Ritsumeikan University","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100005405","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Artif Life Robotics"],"published-print":{"date-parts":[[2025,8]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>In this study, we propose a method for learning a latent space representing 6-DoF poses and performing 6-DoF control in the latent space using NewtonianVAE. NewtonianVAE, a type of world models based on Variational Autoencoder (VAE), can learn the dynamics of the environment as a latent space from observational data and perform proportional control based on the estimated position on the latent space. However, previous research has not demonstrated 6-DoF pose estimation and control using NewtonianVAE. Therefore, we propose 6D NewtonianVAE, which extends the latent space by incorporating the rotation vector to construct the latent space representing 6-DoF poses and perform 6-DoF control based on the estimated poses. Experimental results showed that our method achieves 6-DoF control with an accuracy within 7\u00a0mm and 0.02 rad in a real-world. It was also shown that 6-DoF control is possible even in unseen environments. Our approach enables end-to-end 6-DoF pose estimation and control without annotated data. It also eliminates the need for RGB-D or point cloud data and relies solely on RGB images, reducing implementation and computational costs.<\/jats:p>","DOI":"10.1007\/s10015-025-01026-0","type":"journal-article","created":{"date-parts":[[2025,5,21]],"date-time":"2025-05-21T11:16:29Z","timestamp":1747826189000},"page":"472-483","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["6D NewtonianVAE: 6-DoF object pose estimation and control method for robotic tasks via learning from multi-view visual information"],"prefix":"10.1007","volume":"30","author":[{"given":"Mai","family":"Terashima","sequence":"first","affiliation":[]},{"given":"Ryo","family":"Okumura","sequence":"additional","affiliation":[]},{"given":"Pedro Miguel","family":"Uriguen Eljuri","sequence":"additional","affiliation":[]},{"given":"Katsuyoshi","family":"Maeyama","sequence":"additional","affiliation":[]},{"given":"Yuanyuan","family":"Jia","sequence":"additional","affiliation":[]},{"given":"Tadahiro","family":"Taniguchi","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,5,21]]},"reference":[{"issue":"8","key":"1026_CR1","doi-asserted-by":"publisher","first-page":"1240","DOI":"10.3390\/agriculture12081240","volume":"12","author":"E Vrochidou","year":"2022","unstructured":"Vrochidou E, Tsakalidou VN, Kalathas I, Gkrimpizis T, Pachidis T, Kaburlasos VG (2022) An overview of end effectors in agricultural robotic harvesting systems. Agriculture 12(8):1240","journal-title":"Agriculture"},{"issue":"1","key":"1026_CR2","doi-asserted-by":"publisher","first-page":"651","DOI":"10.1146\/annurev-control-062420-090543","volume":"4","author":"A Attanasio","year":"2021","unstructured":"Attanasio A, Scaglioni B, De Momi E, Fiorini P, Valdastri P (2021) Autonomy in surgical robotics. Annu Rev Control Robot Auton Syst 4(1):651\u2013679","journal-title":"Annu Rev Control Robot Auton Syst"},{"key":"1026_CR3","doi-asserted-by":"crossref","unstructured":"Xiang Y, Schmidt T, Narayanan V, Fox D (2017) \u201cPosecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,\u201d arXiv preprint arXiv:1711.00199","DOI":"10.15607\/RSS.2018.XIV.019"},{"key":"1026_CR4","doi-asserted-by":"crossref","unstructured":"Kendall A, Grimes M, Cipolla R (2015) \u201cPosenet: A convolutional network for real-time 6-dof camera relocalization,\u201d In: Proceedings of the IEEE international conference on computer vision, pp. 2938\u20132946","DOI":"10.1109\/ICCV.2015.336"},{"key":"1026_CR5","doi-asserted-by":"crossref","unstructured":"Li Y, Wang G, Ji X, Xiang Y, Fox D (2018) \u201cDeepim: Deep iterative matching for 6d pose estimation,\u201d In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 683\u2013698","DOI":"10.1007\/978-3-030-01231-1_42"},{"key":"1026_CR6","doi-asserted-by":"crossref","unstructured":"Wang C, Xu D, Zhu Y, Mart\u00edn-Mart\u00edn R, Lu C, Fei-Fei L, Savarese S (2019) \u201cDensefusion: 6d object pose estimation by iterative dense fusion,\u201d In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp. 3343\u20133352","DOI":"10.1109\/CVPR.2019.00346"},{"issue":"5","key":"1026_CR7","doi-asserted-by":"publisher","first-page":"1328","DOI":"10.1109\/TRO.2021.3056043","volume":"37","author":"X Deng","year":"2021","unstructured":"Deng X, Mousavian A, Xiang Y, Xia F, Bretl T, Fox D (2021) PoseRBPF: a Rao-Blackwellized particle filter for 6-D object pose tracking. IEEE Trans Robot 37(5):1328\u20131342","journal-title":"IEEE Trans Robot"},{"key":"1026_CR8","doi-asserted-by":"crossref","unstructured":"Lin J, Wei Z, Li Z, Xu S, Jia K, Li Y (2021) DualPoseNet: category-level 6D object pose and size estimation using dual pose network with refined learning of pose consistency. In: Proceedings of the IEEE\/CVF international conference on computer vision, pp 3560\u20133569","DOI":"10.1109\/ICCV48922.2021.00354"},{"key":"1026_CR9","doi-asserted-by":"crossref","unstructured":"Amini A, Selvam\u00a0Periyasamy A, Behnke S (2022) Yolopose: transformer-based multi-object 6D pose estimation using keypoint regression. In: International conference on intelligent autonomous systems. Springer, pp 392\u2013406","DOI":"10.1007\/978-3-031-22216-0_27"},{"key":"1026_CR10","doi-asserted-by":"crossref","unstructured":"Li Z, Stamos I (2023) Depth-based 6dof object pose estimation using swin transformer. In: 2023 IEEE\/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 1185\u20131191","DOI":"10.1109\/IROS55552.2023.10342215"},{"key":"1026_CR11","unstructured":"Ha D, Schmidhuber J (2018) Recurrent world models facilitate policy evolution. Adv Neural Inf Process Syst 31"},{"issue":"13","key":"1026_CR12","doi-asserted-by":"publisher","first-page":"780","DOI":"10.1080\/01691864.2023.2225232","volume":"37","author":"T Taniguchi","year":"2023","unstructured":"Taniguchi T, Murata S, Suzuki M, Ognibene D, Lanillos P, Ugur E, Jamone L, Nakamura T, Ciria A, Lara B et al (2023) World models and predictive coding for cognitive and developmental robotics: frontiers and challenges. Adv Robot 37(13):780\u2013806","journal-title":"Adv Robot"},{"key":"1026_CR13","doi-asserted-by":"publisher","first-page":"573","DOI":"10.1016\/j.neunet.2021.09.011","volume":"144","author":"K Friston","year":"2021","unstructured":"Friston K, Moran RJ, Nagai Y, Taniguchi T, Gomi H, Tenenbaum J (2021) World model learning and inference. Neural Netw 144:573\u2013590","journal-title":"Neural Netw"},{"key":"1026_CR14","unstructured":"Hafner D, Lillicrap T, Norouzi M, Ba J (2020) Mastering Atari with discrete world models. arXiv:2010.02193"},{"key":"1026_CR15","unstructured":"Hafner D, Lillicrap T, Ba J, Norouzi M (2019) Dream to control: learning behaviors by latent imagination. arXiv:1912.01603"},{"key":"1026_CR16","doi-asserted-by":"crossref","unstructured":"Jaques M, Burke M, Hospedales TM (2021) Newtonianvae: proportional control and goal identification from pixels via physical latent spaces. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 4454\u20134463","DOI":"10.1109\/CVPR46437.2021.00443"},{"key":"1026_CR17","doi-asserted-by":"crossref","unstructured":"Okumura R, Nishio N, Taniguchi T (2022) Tactile-sensitive newtonianvae for high-accuracy industrial connector-socket insertion. arXiv:2203.05955","DOI":"10.1109\/IROS47612.2022.9981610"},{"key":"1026_CR18","unstructured":"Kato Y, Okumura R, Taniguchi T (2023) World-model-based control for industrial box-packing of multiple objects using newtonianvae. arXiv:2308.02136"},{"key":"1026_CR19","unstructured":"Terashima M, Shibata H, Ito M, Taniguchi T et\u00a0al (2023) Multi-modal newtonianvae: high-precision reaching method for autonomous suturing. In: Proceedings of the annual conference of JSAI. Japn Soc Artif Intell 2G1OS21c02"},{"key":"1026_CR20","doi-asserted-by":"crossref","unstructured":"Lowe DG (1999) Object recognition from local scale-invariant features. In: Proceedings of the seventh IEEE international conference on computer vision, vol\u00a02. IEEE, pp 1150\u20131157","DOI":"10.1109\/ICCV.1999.790410"},{"key":"1026_CR21","doi-asserted-by":"crossref","unstructured":"Hinterstoisser S, Holzer S, Cagniart C, Ilic S, Konolige K, Navab N, Lepetit V (2011) Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In: 2011 international conference on computer vision. IEEE, pp 858\u2013865","DOI":"10.1109\/ICCV.2011.6126326"},{"key":"1026_CR22","doi-asserted-by":"crossref","unstructured":"Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin Z, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE\/CVF international conference on computer vision, pp 10012\u201310022","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"1026_CR23","doi-asserted-by":"crossref","unstructured":"Hagelskj\u00e6r F, Buch AG (2020) PointVoteNet: accurate object detection and 6 DOF pose estimation in point clouds. In: 2020 IEEE international conference on image processing (ICIP). IEEE, pp 2641\u20132645","DOI":"10.1109\/ICIP40778.2020.9191119"},{"key":"1026_CR24","unstructured":"Hafner D, Lillicrap T, Fischer I, Villegas R, Ha D, Lee H, Davidson J (2019) Learning latent dynamics for planning from pixels. In: International conference on machine learning. PMLR, pp 2555\u20132565"},{"issue":"9","key":"1026_CR25","doi-asserted-by":"publisher","first-page":"1517","DOI":"10.1080\/00207170701491070","volume":"80","author":"A Richards","year":"2007","unstructured":"Richards A, How JP (2007) Robust distributed model predictive control. Int J Control 80(9):1517\u20131531","journal-title":"Int J Control"},{"key":"1026_CR26","unstructured":"Chen C, Wu Y-F, Yoon J, Ahn S (2022) Transdreamer: reinforcement learning with transformer world models. arXiv:2202.09481"},{"key":"1026_CR27","unstructured":"Vaswani A (2017) Attention is all you need. Adv Neural Inf Process Syst"},{"issue":"19","key":"1026_CR28","doi-asserted-by":"publisher","first-page":"1212","DOI":"10.1080\/01691864.2023.2264363","volume":"37","author":"A Kinose","year":"2023","unstructured":"Kinose A, Okada M, Okumura R, Taniguchi T (2023) Multi-view dreaming: multi-view world model with contrastive learning. Adv Robot 37(19):1212\u20131220","journal-title":"Adv Robot"},{"issue":"8","key":"1026_CR29","doi-asserted-by":"publisher","first-page":"1771","DOI":"10.1162\/089976602760128018","volume":"14","author":"GE Hinton","year":"2002","unstructured":"Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771\u20131800","journal-title":"Neural Comput"},{"key":"1026_CR30","unstructured":"Shi Y, Paige B, Torr P et\u00a0al (2019) Variational mixture-of-experts autoencoders for multi-modal deep generative models. Adv Neural Inf Process Syst 32"},{"key":"1026_CR31","unstructured":"Sutter TM, Daunhawer I, Vogt JE (2021) Generalized multimodal ELBO. In: 9th international conference on learning representations (ICLR 2021)"},{"issue":"3","key":"1026_CR32","doi-asserted-by":"publisher","first-page":"36","DOI":"10.1109\/MRA.2015.2448951","volume":"22","author":"B Calli","year":"2015","unstructured":"Calli B, Walsman A, Singh A, Srinivasa S, Abbeel P, Dollar AM (2015) Benchmarking in manipulation research: using the Yale-CMU-Berkeley object and model set. IEEE Robot Autom Mag 22(3):36\u201352","journal-title":"IEEE Robot Autom Mag"},{"key":"1026_CR33","unstructured":"Tassa Y, Doron Y, Muldal A, Erez T, Li Y, Casas DDL, Budden D, Abdolmaleki A, Merel J, Lefrancq A et\u00a0al (2018) Deepmind control suite. arXiv:1801.00690"},{"key":"1026_CR34","doi-asserted-by":"crossref","unstructured":"Zhao TZ, Kumar V, Levine S, Finn C (2023) Learning fine-grained bimanual manipulation with low-cost hardware. arXiv:2304.13705","DOI":"10.15607\/RSS.2023.XIX.016"},{"key":"1026_CR35","doi-asserted-by":"crossref","unstructured":"He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770\u2013778","DOI":"10.1109\/CVPR.2016.90"}],"container-title":["Artificial Life and Robotics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10015-025-01026-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10015-025-01026-0\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10015-025-01026-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,6]],"date-time":"2025-09-06T15:11:52Z","timestamp":1757171512000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10015-025-01026-0"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,21]]},"references-count":35,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,8]]}},"alternative-id":["1026"],"URL":"https:\/\/doi.org\/10.1007\/s10015-025-01026-0","relation":{},"ISSN":["1433-5298","1614-7456"],"issn-type":[{"type":"print","value":"1433-5298"},{"type":"electronic","value":"1614-7456"}],"subject":[],"published":{"date-parts":[[2025,5,21]]},"assertion":[{"value":"27 September 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 April 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"21 May 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}