{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,5]],"date-time":"2025-12-05T12:18:22Z","timestamp":1764937102654,"version":"build-2065373602"},"reference-count":54,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2019,8,6]],"date-time":"2019-08-06T00:00:00Z","timestamp":1565049600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MAKE"],"abstract":"<jats:p>Recognizing objects and estimating their poses have a wide range of application in robotics. For instance, to grasp objects, robots need the position and orientation of objects in 3D. The task becomes challenging in a cluttered environment with different types of objects. A popular approach to tackle this problem is to utilize a deep neural network for object recognition. However, deep learning-based object detection in cluttered environments requires a substantial amount of data. Collection of these data requires time and extensive human labor for manual labeling. In this study, our objective was the development and validation of a deep object recognition framework using a synthetic depth image dataset. We synthetically generated a depth image dataset of 22 objects randomly placed in a 0.5 m \u00d7 0.5 m \u00d7 0.1 m box, and automatically labeled all objects with an occlusion rate below 70%. Faster Region Convolutional Neural Network (R-CNN) architecture was adopted for training using a dataset of 800,000 synthetic depth images, and its performance was tested on a real-world depth image dataset consisting of 2000 samples. Deep object recognizer has 40.96% detection accuracy on the real depth images and 93.5% on the synthetic depth images. Training the deep learning model with noise-added synthetic images improves the recognition accuracy for real images to 46.3%. The object detection framework can be trained on synthetically generated depth data, and then employed for object recognition on the real depth data in a cluttered environment. Synthetic depth data-based deep object detection has the potential to substantially decrease the time and human effort required for the extensive data collection and labeling.<\/jats:p>","DOI":"10.3390\/make1030051","type":"journal-article","created":{"date-parts":[[2019,8,7]],"date-time":"2019-08-07T03:09:08Z","timestamp":1565147348000},"page":"883-903","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":10,"title":["Deep Learning Based Object Recognition Using Physically-Realistic Synthetic Depth Scenes"],"prefix":"10.3390","volume":"1","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1432-8205","authenticated-orcid":false,"given":"Daulet","family":"Baimukashev","sequence":"first","affiliation":[{"name":"Department of Robotics, Nazarbayev University, 53 Kabanbay batyr Ave., Astana Z05H0P9, Kazakhstan"}]},{"given":"Alikhan","family":"Zhilisbayev","sequence":"additional","affiliation":[{"name":"Department of Robotics, Nazarbayev University, 53 Kabanbay batyr Ave., Astana Z05H0P9, Kazakhstan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6169-8252","authenticated-orcid":false,"given":"Askat","family":"Kuzdeuov","sequence":"additional","affiliation":[{"name":"Department of Robotics, Nazarbayev University, 53 Kabanbay batyr Ave., Astana Z05H0P9, Kazakhstan"}]},{"given":"Artemiy","family":"Oleinikov","sequence":"additional","affiliation":[{"name":"Department of Robotics, Nazarbayev University, 53 Kabanbay batyr Ave., Astana Z05H0P9, Kazakhstan"}]},{"given":"Denis","family":"Fadeyev","sequence":"additional","affiliation":[{"name":"Department of Robotics, Nazarbayev University, 53 Kabanbay batyr Ave., Astana Z05H0P9, Kazakhstan"}]},{"given":"Zhanat","family":"Makhataeva","sequence":"additional","affiliation":[{"name":"Department of Robotics, Nazarbayev University, 53 Kabanbay batyr Ave., Astana Z05H0P9, Kazakhstan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4042-425X","authenticated-orcid":false,"given":"Huseyin Atakan","family":"Varol","sequence":"additional","affiliation":[{"name":"Department of Robotics, Nazarbayev University, 53 Kabanbay batyr Ave., Astana Z05H0P9, Kazakhstan"}]}],"member":"1968","published-online":{"date-parts":[[2019,8,6]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"172","DOI":"10.1109\/TASE.2016.2600527","article-title":"Analysis and Observations From the First Amazon Picking Challenge","volume":"15","author":"Correll","year":"2018","journal-title":"IEEE Trans. Autom. Sci. Eng."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Li, W., Luo, Y., Wang, P., Qin, Z., Zhou, H., and Qiao, H. (2016, January 3\u20137). Recent advances on application of deep learning for recovering object pose. Proceedings of the IEEE International Conference on Robotics and Biomimetics (ROBIO), Qingdao, China.","DOI":"10.1109\/ROBIO.2016.7866501"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"436","DOI":"10.1038\/nature14539","article-title":"Deep learning","volume":"521","author":"LeCun","year":"2015","journal-title":"Nature"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Gupta, S., Arbel\u00e1ez, P., Girshick, R., and Malik, J. (2015, January 7\u201312). Aligning 3D models to RGB-D images of cluttered scenes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299105"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Peng, X., Sun, B., Ali, K., and Saenko, K. (2015, January 7\u201313). Learning deep object detectors from 3D models. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.151"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014, January 23\u201328). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.81"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016, January 27\u201330). You only look once: Unified, real-time object detection. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.91"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., and Berg, A.C. (2016, January 8\u201316). SSD: Single shot multibox detector. Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Ren, S., He, K., Girshick, R., and Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 1137\u20131149.","DOI":"10.1109\/TPAMI.2016.2577031"},{"key":"ref_10","unstructured":"Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012, January 3\u20136). ImageNet classification with deep convolutional neural networks. Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA."},{"key":"ref_11","unstructured":"Carlucci, F.M., Russo, P., and Caputo, B. (June, January 29). A deep representation for depth images from synthetic data. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Singapore."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Song, S., Yumer, E., Savva, M., Lee, J.Y., Jin, H., and Funkhouser, T. (2017, January 21\u201326). Physically-based rendering for indoor scene understanding using convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.537"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Mitash, C., Bekris, K.E., and Boularias, A. (2017, January 24\u201328). A self-supervised learning system for object detection using physics simulation and multi-view pose estimation. Proceedings of the IEEE\/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada.","DOI":"10.1109\/IROS.2017.8202206"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Ben-David, S., Blitzer, J., Crammer, K., and Pereira, F. (2007, January 3\u20136). Analysis of representations for domain adaptation. Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada.","DOI":"10.7551\/mitpress\/7503.003.0022"},{"key":"ref_15","first-page":"1","article-title":"Domain-adversarial training of neural networks","volume":"17","author":"Ganin","year":"2016","journal-title":"J. Mach. Learn. Res."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Hinterstoisser, S., Lepetit, V., Wohlhart, P., and Konolige, K. (2018, January 8\u201314). On Pre-trained Image Features and Synthetic Images for Deep Learning. Proceedings of the ECCV Workshops, Munich, Germany.","DOI":"10.1007\/978-3-030-11009-3_42"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Mithun, N.C., Munir, S., Guo, K., and Shelton, C. (2018, January 11\u201313). ODDS: Real-Time Object Detection Using Depth Sensors on Embedded GPUs. Proceedings of the 2018 17th ACM\/IEEE International Conference on Information Processing in Sensor Networks (IPSN), Porto, Portugal.","DOI":"10.1109\/IPSN.2018.00051"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Pinto, N., Barhomi, Y., Cox, D.D., and DiCarlo, J.J. (2011, January 5\u20137). Comparing state-of-the-art visual features on invariant object recognition tasks. Proceedings of the IEEE Workshop on Applications of Computer Vision (WACV), Kona, HI, USA.","DOI":"10.1109\/WACV.2011.5711540"},{"key":"ref_19","unstructured":"Rajpura, P.S., Hegde, R.S., and Bojinov, H. (arXiv, 2017). Object Detection Using Deep CNNs Trained on Synthetic Images, arXiv."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Georgakis, G., Mousavian, A., Berg, A.C., and Kosecka, J. (arXiv, 2017). Synthesizing Training Data for Object Detection in Indoor Scenes, arXiv.","DOI":"10.15607\/RSS.2017.XIII.043"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Dwibedi, D., Misra, I., and Hebert, M. (2017, January 22\u201329). Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection. Proceedings of the IEEE International Conference on Computer Vision 2017, Venice, Italy.","DOI":"10.1109\/ICCV.2017.146"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Lai, K., Bo, L., Ren, X., and Fox, D. (2011, January 9\u201313). A large-scale hierarchical multi-view RGB-D object dataset. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China.","DOI":"10.1109\/ICRA.2011.5980382"},{"key":"ref_23","unstructured":"Socher, R., Huval, B., Bath, B., Manning, C.D., and Ng, A.Y. (2012, January 3\u20136). Convolutional-recursive deep learning for 3D object classification. Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"437","DOI":"10.1177\/0278364917713117","article-title":"RGB-D object detection and semantic segmentation for autonomous manipulation in clutter","volume":"37","author":"Schwarz","year":"2018","journal-title":"Int. J. Robot. Res."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Newcombe, R.A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A.J., Kohi, P., Shotton, J., Hodges, S., and Fitzgibbon, A. (2011, January 26\u201329). KinectFusion: Real-time dense surface mapping and tracking. Proceedings of the IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Basel, Switzerland.","DOI":"10.1109\/ISMAR.2011.6092378"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe, R., Kohli, P., Shotton, J., Hodges, S., Freeman, D., and Davison, A. (2011, January 16\u201319). KinectFusion: Real-time 3D reconstruction and interaction using a moving depth camera. Proceedings of the ACM Symposium on User Interface Software and Technology, Santa Barbara, CA, USA.","DOI":"10.1145\/2047196.2047270"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., and Blake, A. (2011, January 20\u201325). Real-time human pose recognition in parts from single depth images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Washington, DC, USA.","DOI":"10.1109\/CVPR.2011.5995316"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Biswas, J., and Veloso, M. (2012, January 14\u201318). Depth camera based indoor mobile robot localization and navigation. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), St Paul, MN, USA.","DOI":"10.1109\/ICRA.2012.6224766"},{"key":"ref_29","unstructured":"Maier, D., Hornung, A., and Bennewitz, M. (December, January 29). Real-time navigation in 3D environments based on depth camera data. Proceedings of the IEEE-RAS International Conference on Humanoid Robots, Osaka, Japan."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"425732","DOI":"10.1155\/2015\/425732","article-title":"Locomotion strategy selection for a hybrid mobile robot using time of flight depth sensor","volume":"2015","author":"Saudabayev","year":"2015","journal-title":"J. Sens."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"1759","DOI":"10.1109\/TBME.2017.2776157","article-title":"User-Independent Intent Recognition for Lower Limb Prostheses Using Depth Sensing","volume":"65","author":"Massalin","year":"2018","journal-title":"IEEE Trans. Biomed. Eng."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"180101","DOI":"10.1038\/sdata.2018.101","article-title":"Human grasping database for activities of daily living with depth, color and kinematic data streams","volume":"5","author":"Saudabayev","year":"2018","journal-title":"Sci. Data"},{"key":"ref_33","unstructured":"Koenig, N.P., and Howard, A. (October, January 28). Design and use paradigms for Gazebo, an open-source multi-robot simulator. Proceedings of the IEEE\/RSJ International Conference on Intelligent Robots and Systems (IROS), Sendai, Japan."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Gschwandtner, M., Kwitt, R., Uhl, A., and Pree, W. (2011, January 26\u201328). BlenSor: Blender Sensor Simulation Toolbox. Proceedings of the International Symposium on Visual Computing, Las Vegas, NV, USA.","DOI":"10.1007\/978-3-642-24031-7_20"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Liebelt, J., and Schmid, C. (2010, January 13\u201318). Multi-view object class detection with a 3D geometric model. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA.","DOI":"10.1109\/CVPR.2010.5539836"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Gupta, A., Vedaldi, A., and Zisserman, A. (2016, January 27\u201330). Synthetic data for text localisation in natural images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.254"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Wanner, S., and Goldluecke, B. (2012, January 18\u201320). Globally consistent depth labeling of 4D light fields. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA.","DOI":"10.1109\/CVPR.2012.6247656"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"191","DOI":"10.1109\/TBC.2005.846190","article-title":"Stereoscopic image generation based on depth images for 3D TV","volume":"51","author":"Zhang","year":"2005","journal-title":"IEEE Trans. Broadcast."},{"key":"ref_39","unstructured":"Cheng, C.M., Lin, S.J., Lai, S.H., and Yang, J.C. (2008, January 8\u201311). Improved novel view synthesis from depth image with large baseline. Proceedings of the International Conference on Pattern Recognition (ICPR), Tampa, FL, USA."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"122","DOI":"10.1016\/j.image.2008.10.008","article-title":"Depth-image-based rendering for 3DTV service over T-DMB","volume":"24","author":"Park","year":"2009","journal-title":"Signal Process. Image Commun."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"33","DOI":"10.1002\/jemt.20092","article-title":"Complex wavelets for extended depth-of-field: A new method for the fusion of multichannel microscopy images","volume":"65","author":"Forster","year":"2004","journal-title":"Microsc. Res. Tech."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"519","DOI":"10.1145\/1141911.1141918","article-title":"Fast median and bilateral filtering","volume":"25","author":"Weiss","year":"2006","journal-title":"ACM Trans. Graph. (TOG)"},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"75","DOI":"10.1016\/j.infrared.2004.03.011","article-title":"Infrared image processing and data analysis","volume":"46","author":"Gonzalez","year":"2004","journal-title":"Infrared Phys. Technol."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Huang, J., Rathod, V., Sun, C., Zhu, M., Balan, A.K., Fathi, A., Fischer, I., Wojna, Z., Song, Y., and Guadarrama, S. (2017, January 20\u201325). Speed\/Accuracy Trade-Offs for Modern Convolutional Object Detectors. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.351"},{"key":"ref_45","unstructured":"Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. (arXiv, 2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, arXiv."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Ioffe, S., and Vanhoucke, V. (2017, January 4\u20139). Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA.","DOI":"10.1609\/aaai.v31i1.11231"},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"303","DOI":"10.1007\/s11263-009-0275-4","article-title":"The Pascal Visual Object Classes (VOC) Challenge","volume":"88","author":"Everingham","year":"2010","journal-title":"Int. J. Comput. Vis."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Falie, D., and Buzuloiu, V. (2007, January 13\u201314). Noise Characteristics of 3D Time-of-Flight Cameras. Proceedings of the International Symposium on Signals, Circuits and Systems, Iasi, Romania.","DOI":"10.1109\/ISSCS.2007.4292693"},{"key":"ref_50","unstructured":"Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Neural Inf. Process. Syst., 25."},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015, January 7\u201312). Going deeper with convolutions. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"ref_52","unstructured":"Ren, S., He, K., Girshick, R., and Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, Proceedings of the 29th Annual Conference on Neural Information Processing Systems 2015, Montreal, QC, Canada, 7\u201312 December 2015, Neural Information Processing Systems Foundation, Inc. (NIPS)."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"McCormac, J., Handa, A., Leutenegger, S., and Davison, A.J. (2017, January 22\u201329). SceneNet RGB-D: Can 5M Synthetic Images Beat Generic ImageNet Pre-training on Indoor Segmentation?. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.292"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Eitel, A., Springenberg, J.T., Spinello, L., Riedmiller, M., and Burgard, W. (October, January 28). Multimodal deep learning for robust RGB-D object recognition. Proceedings of the 2015 IEEE\/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany.","DOI":"10.1109\/IROS.2015.7353446"}],"container-title":["Machine Learning and Knowledge Extraction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-4990\/1\/3\/51\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T13:09:04Z","timestamp":1760188144000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-4990\/1\/3\/51"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,8,6]]},"references-count":54,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2019,9]]}},"alternative-id":["make1030051"],"URL":"https:\/\/doi.org\/10.3390\/make1030051","relation":{},"ISSN":["2504-4990"],"issn-type":[{"type":"electronic","value":"2504-4990"}],"subject":[],"published":{"date-parts":[[2019,8,6]]}}}