{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,14]],"date-time":"2025-10-14T00:32:07Z","timestamp":1760401927780,"version":"build-2065373602"},"reference-count":39,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2020,4,17]],"date-time":"2020-04-17T00:00:00Z","timestamp":1587081600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>This paper proposes a framework that allows the observation of a scene iteratively to answer a given question about the scene. Conventional visual question answering (VQA) methods are designed to answer given questions based on single-view images. However, in real-world applications, such as human\u2013robot interaction (HRI), in which camera angles and occluded scenes must be considered, answering questions based on single-view images might be difficult. Since HRI applications make it possible to observe a scene from multiple viewpoints, it is reasonable to discuss the VQA task in multi-view settings. In addition, because it is usually challenging to observe a scene from arbitrary viewpoints, we designed a framework that allows the observation of a scene actively until the necessary scene information to answer a given question is obtained. The proposed framework achieves comparable performance to a state-of-the-art method in question answering and simultaneously decreases the number of required observation viewpoints by a significant margin. Additionally, we found our framework plausibly learned to choose better viewpoints for answering questions, lowering the required number of camera movements. Moreover, we built a multi-view VQA dataset based on real images. The proposed framework shows high accuracy (94.01%) for the unseen real image dataset.<\/jats:p>","DOI":"10.3390\/s20082281","type":"journal-article","created":{"date-parts":[[2020,4,21]],"date-time":"2020-04-21T04:49:38Z","timestamp":1587444578000},"page":"2281","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":10,"title":["Multi-View Visual Question Answering with Active Viewpoint Selection"],"prefix":"10.3390","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2181-9475","authenticated-orcid":false,"given":"Yue","family":"Qiu","sequence":"first","affiliation":[{"name":"Graduate School of Science and Technology, University of Tsukuba, Tsukuba 305-8577, Japan"},{"name":"National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0638-0855","authenticated-orcid":false,"given":"Yutaka","family":"Satoh","sequence":"additional","affiliation":[{"name":"Graduate School of Science and Technology, University of Tsukuba, Tsukuba 305-8577, Japan"},{"name":"National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2156-844X","authenticated-orcid":false,"given":"Ryota","family":"Suzuki","sequence":"additional","affiliation":[{"name":"National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, Japan"}]},{"given":"Kenji","family":"Iwata","sequence":"additional","affiliation":[{"name":"National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8844-165X","authenticated-orcid":false,"given":"Hirokatsu","family":"Kataoka","sequence":"additional","affiliation":[{"name":"National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, Japan"}]}],"member":"1968","published-online":{"date-parts":[[2020,4,17]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., and Parikh, D. (2015, January 11\u201318). Vqa: Visual question answering. Proceedings of the IEEE International Conference on Computer Vision, Las Condes, Chile.","DOI":"10.1109\/ICCV.2015.279"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. (2017, January 21\u201326). Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.670"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. (2018, January 18\u201324). Bottom-up and top-down attention for image captioning and visual question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00636"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., and Rohrbach, M. (2016). Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv.","DOI":"10.18653\/v1\/D16-1044"},{"key":"ref_5","unstructured":"Hudson, D.A., and Manning, C.D. (2018). Compositional attention networks for machine reasoning. arXiv."},{"key":"ref_6","unstructured":"Kim, J.H., Jun, J., and Zhang, B.T. (2018). Bilinear attention networks. arXiv."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J.M.F., Parikh, D., and Batra, D. (2017, January 21\u201326). Visual dialog. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.121"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Das, A., Kottur, S., Moura, J.M.F., Lee, S., and Batra, D. (2017, January 22\u201329). Learning cooperative visual dialog agents with deep reinforcement learning. Proceedings of the IEEE international conference on computer vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.321"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Qiu, Y., Satoh, Y., Suzuki, R., and Kataoka, H. (2019, January 16\u201319). Incorporating 3D Information into Visual Question Answering. Proceedings of the IEEE International Conference on 3D Vision (3DV), Quebec City, QB, Canada.","DOI":"10.1109\/3DV.2019.00088"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Perez, E., Strub, F., De Vries, H., Dumoulin, V., and Courville, A. (2018, January 2\u20137). Film: Visual reasoning with a general conditioning layer. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.11671"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D.C., Zitnick, L., and Dollar, P. (2014, January 6\u201312). Microsoft Coco: Common Objects in Context. Proceedings of the European Conference on Computer Vision, Zurich, Switzerland.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Johnson, J., Hariharan, B., van der Maaten, L., Li, F.-F.C., Zitnick, L., and Girshick, R. (2017, January 21\u201326). Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.215"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Su, H., Jampani, V., Sun, D., Maji, S., Kalogerakis, E., Yang, M.-H., and Kautz, J. (2018, January 18\u201323). Splatnet: Sparse lattice networks for point cloud processing. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00268"},{"key":"ref_14","unstructured":"Qi, C.R., Su, H., Mo, K., and Guibas, L.J. (2017, January 21\u201326). Pointnet: Deep learning on point sets for 3d classification and segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Gkioxari, G., Malik, J., and Johnson, J. (2019, January 27\u201328). Mesh r-cnn. Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00988"},{"key":"ref_16","unstructured":"Liu, Z., Tang, H., Lin, Y., and Han, S. (2019, January 8\u201314). Point-Voxel CNN for efficient 3D deep learning. Advances in Neural Information Processing Systems. Proceedings of the Conference on Neural Information Processing Systems, Vancouver, BC, Canada."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"1204","DOI":"10.1126\/science.aar6170","article-title":"Neural scene representation and rendering","volume":"360","author":"Eslami","year":"2018","journal-title":"Science"},{"key":"ref_18","unstructured":"Kingma, D.P., Mohamed, S., Rezende, D.J., Mohamed, S., and Welling, M. (2014). Semi-supervised learning with deep generative models. arXiv."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Park, J.J., Florence, P., Straub, J., Newcombe, R., and Lovegrove, S. (2019, January 16\u201320). Deepsdf: Learning continuous signed distance functions for shape representation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00025"},{"key":"ref_20","unstructured":"Sitzmann, V., Zollh\u00f6fer, M., and Wetzstein, G. (2019). Scene representation networks: Continuous 3D-structure-aware neural scene representations. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"279","DOI":"10.1007\/BF00992698","article-title":"Q-learning","volume":"8","author":"Watkins","year":"1992","journal-title":"Mach. Learn."},{"key":"ref_22","unstructured":"Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv."},{"key":"ref_23","first-page":"1334","article-title":"End-to-end training of deep visuomotor policies","volume":"17","author":"Levine","year":"2016","journal-title":"J. Mach. Learn. Res."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., and Batra, D. (2018, January 18\u201323). Embodied question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00008"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Yu, L., Chen, X., Gkioxari, G., Bansal, M., Berg, T.L., and Batra, D. (2019, January 16\u201320). Multi-target embodied question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00647"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Wijmans, E., Datta, S., Maksymets, O., Das, A., Gkioxari, G., Lee, S., Essa, I., Parikh, D., and Batra, D. (2019, January 16\u201320). Embodied question answering in photorealistic environments with point cloud perception. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00682"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"2451","DOI":"10.1162\/089976600300015015","article-title":"Learning to forget: Continual prediction with LSTM","volume":"12","author":"Gers","year":"2000","journal-title":"Neural Comput."},{"key":"ref_28","unstructured":"Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., and Antigaet, L. (2019). PyTorch: An imperative style, high-performance deep learning library. arXiv."},{"key":"ref_29","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Cho, K., Van Merri\u00ebnboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv.","DOI":"10.3115\/v1\/D14-1179"},{"key":"ref_31","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhinl, I. (2017). Attention is all you need. arXiv."},{"key":"ref_32","unstructured":"(2020, April 08). FiLM Implementation Code. Available online: https:\/\/github.com\/ethanjperez\/film."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Su, H., Maji, S., Kalogerakis, E., and Learned-Miller, E. (2015, January 7\u201313). Multi-view convolutional neural networks for 3D shape recognition. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.114"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Kanezaki, A., Matsushita, Y., and Nishida, Y. (2018, January 18\u201323). Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00526"},{"key":"ref_35","unstructured":"(2020, April 08). Software Blender Site. Available online: https:\/\/www.blender.org\/."},{"key":"ref_36","unstructured":"(2020, April 08). CLEVR Dataset Generation Implementation Code. Available online: https:\/\/github.com\/facebookresearch\/clevr-dataset-gen."},{"key":"ref_37","unstructured":"(2020, April 08). Sony Alpha 7 III Mirrorless Single Lens Digital Camera Site. Available online: https:\/\/www.sony.jp\/ichigan\/products\/ILCE-7RM3\/."},{"key":"ref_38","unstructured":"(2020, April 08). 3D-MFP Site. Available online: https:\/\/www.ortery.jp\/photography-equipment\/3d-photography-ja\/3d-mfp\/."},{"key":"ref_39","unstructured":"(2020, April 08). 3DSOM Pro V5 Site. Available online: http:\/\/www.3dsom.com\/new-release-v5-markerless-workflow\/."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/20\/8\/2281\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,13]],"date-time":"2025-10-13T13:21:19Z","timestamp":1760361679000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/20\/8\/2281"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,4,17]]},"references-count":39,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2020,4]]}},"alternative-id":["s20082281"],"URL":"https:\/\/doi.org\/10.3390\/s20082281","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2020,4,17]]}}}