{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,12]],"date-time":"2026-06-12T03:29:06Z","timestamp":1781234946160,"version":"3.54.1"},"reference-count":69,"publisher":"Association for Computing Machinery (ACM)","issue":"6","license":[{"start":{"date-parts":[[2023,7,12]],"date-time":"2023-07-12T00:00:00Z","timestamp":1689120000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2023,11,30]]},"abstract":"<jats:p>Visual question answering has recently been settled as a fundamental multi-modal reasoning task of artificial intelligence that allows users to get information about visual content by asking questions in natural language. In the cultural heritage domain, this task can contribute to assisting visitors in museums and cultural sites, thus increasing engagement. However, the development of visual question answering models for cultural heritage is prevented by the lack of suitable large-scale datasets. To meet this demand, we built a large-scale heterogeneous and multilingual (Italian and English) dataset for cultural heritage that comprises approximately 500K Italian cultural assets and 6.5M question-answer pairs. We propose a novel formulation of the task that requires reasoning over both the visual content and an associated natural language description, and present baselines for this task. Results show that the current state of the art is reasonably effective but still far from satisfactory; therefore, further research in this area is recommended. Nonetheless, we also present a holistic baseline to address visual and contextual questions and foster future research on the topic.<\/jats:p>","DOI":"10.1145\/3590773","type":"journal-article","created":{"date-parts":[[2023,4,4]],"date-time":"2023-04-04T12:22:16Z","timestamp":1680610936000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":17,"title":["VISCOUNTH: A Large-scale Multilingual Visual Question Answering Dataset for Cultural Heritage"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2537-2700","authenticated-orcid":false,"given":"Federico","family":"Becattini","sequence":"first","affiliation":[{"name":"University of Florence, Italy"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8217-6266","authenticated-orcid":false,"given":"Pietro","family":"Bongini","sequence":"additional","affiliation":[{"name":"University of Florence, Italy"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1165-853X","authenticated-orcid":false,"given":"Luana","family":"Bulla","sequence":"additional","affiliation":[{"name":"Institute of Science and Technology of Cognition, National Research Council, Italy"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1052-8322","authenticated-orcid":false,"given":"Alberto Del","family":"Bimbo","sequence":"additional","affiliation":[{"name":"University of Florence, Italy"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1605-8819","authenticated-orcid":false,"given":"Ludovica","family":"Marinucci","sequence":"additional","affiliation":[{"name":"Institute of Science and Technology of Cognition, National Research Council, Italy"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0528-5490","authenticated-orcid":false,"given":"Misael","family":"Mongiov\u00ec","sequence":"additional","affiliation":[{"name":"Institute of Science and Technology of Cognition, National Research Council, Italy"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9380-5160","authenticated-orcid":false,"given":"Valentina","family":"Presutti","sequence":"additional","affiliation":[{"name":"University of Bologna, Italy"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2023,7,12]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.279"},{"key":"e_1_3_2_3_2","first-page":"193","volume-title":"Proceedings of the 7th International Conference on Machine Learning, Optimization, and Data Science (LOD\u201921)","author":"Asprino Luigi","year":"2021","unstructured":"Luigi Asprino, Luana Bulla, Ludovica Marinucci, Misael Mongiov\u00ec, and Valentina Presutti. 2021. A large visual question answering dataset for cultural heritage. In Proceedings of the 7th International Conference on Machine Learning, Optimization, and Data Science (LOD\u201921). 193\u2013197."},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-76298-0_52"},{"key":"e_1_3_2_5_2","first-page":"5422","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Bai Zechen","year":"2021","unstructured":"Zechen Bai, Yuta Nakashima, and Noa Garcia. 2021. Explain me the painting: Multi-topic knowledgeable art description generation. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 5422\u20135432."},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2021.09.008"},{"key":"e_1_3_2_7_2","first-page":"781","volume-title":"Proceedings of the Euro-Mediterranean Conference","author":"Becattini Federico","year":"2016","unstructured":"Federico Becattini, Andrea Ferracani, Lea Landucci, Daniele Pezzatini, Tiberio Uricchio, and Alberto Del Bimbo. 2016. Imaging novecento: A mobile app for automatic recognition of artworks and transfer of artistic styles. In Proceedings of the Euro-Mediterranean Conference. Springer, 781\u2013791."},{"key":"e_1_3_2_8_2","volume-title":"International Conference on Pattern Recognition","author":"Vannoni Francesco","year":"2020","unstructured":"Francesco Vannoni, Pietro Bongini, Federico Becattini, Andrew David Bagdanov, and Alberto Del Bimbo. 2020. Data collection for contextual and visual question answering in the cultural heritage domain. In International Conference on Pattern Recognition."},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00439"},{"key":"e_1_3_2_10_2","doi-asserted-by":"crossref","unstructured":"Pietro Bongini Federico Becattini Andrew D. Bagdanov and Alberto Del Bimbo. 2020. Visual question answering for cultural heritage. Retrieved from https:\/\/arxiv.org\/abs\/2003.09853.","DOI":"10.1088\/1757-899X\/949\/1\/012074"},{"key":"e_1_3_2_11_2","first-page":"268","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV\u201922)","author":"Bongini Pietro","year":"2023","unstructured":"Pietro Bongini, Federico Becattini, and Alberto Del Bimbo. 2023. Is GPT-3 all you need for visual question answering in cultural heritage? In Proceedings of the European Conference on Computer Vision (ECCV\u201922). Springer, 268\u2013281."},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-36107-5_4"},{"key":"e_1_3_2_13_2","doi-asserted-by":"crossref","first-page":"529","DOI":"10.1007\/978-3-031-15743-1_48","volume-title":"Proceedings of the Conference on New Trends in Database and Information Systems (ADBIS\u201922)","author":"Bulla Luana","year":"2022","unstructured":"Luana Bulla, Maria Chiara Frangipane, Maria Letizia Mancinelli, Ludovica Marinucci, Misael Mongiov\u00ec, Margherita Porena, Valentina Presutti, and Chiara Veninata. 2022. Developing and aligning a detailed controlled vocabulary for artwork. In Proceedings of the Conference on New Trends in Database and Information Systems (ADBIS\u201922). Springer, 529\u2013541."},{"key":"e_1_3_2_14_2","first-page":"36","volume-title":"Proceedings of the International Semantic Web Conference (ISWC\u201919)","author":"Carriero Valentina Anita","year":"2019","unstructured":"Valentina Anita Carriero, Aldo Gangemi, Maria Letizia Mancinelli, Ludovica Marinucci, Andrea Giovanni Nuzzolese, Valentina Presutti, and Chiara Veninata. 2019. ArCo: The Italian cultural heritage knowledge graph. In Proceedings of the International Semantic Web Conference (ISWC\u201919). 36\u201352."},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00521-021-05893-z"},{"key":"e_1_3_2_16_2","first-page":"502","volume-title":"Proceedings of the International Conference on Pattern Recognition","author":"Cetinic Eva","year":"2021","unstructured":"Eva Cetinic. 2021. Iconographic image captioning for artworks. In Proceedings of the International Conference on Pattern Recognition. Springer, 502\u2013516."},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2018.07.026"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3475799"},{"key":"e_1_3_2_19_2","first-page":"104","volume-title":"Proceedings of the 16th European Conference on Computer Vision (ECCV\u201920)","author":"Chen Yen-Chun","year":"2020","unstructured":"Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In Proceedings of the 16th European Conference on Computer Vision (ECCV\u201920). Springer, 104\u2013120."},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/MMUL.2014.19"},{"key":"e_1_3_2_21_2","first-page":"467","volume-title":"Proceedings of the International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP\u201919)","author":"Chiaro Riccardo Del","year":"2019","unstructured":"Riccardo Del Chiaroet al.2019. NoisyArt: A dataset for webly-supervised artwork recognition. In Proceedings of the International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP\u201919). 467\u2013475."},{"key":"e_1_3_2_22_2","doi-asserted-by":"crossref","first-page":"420","DOI":"10.1016\/j.patrec.2019.09.027","article-title":"Webly\u2014Supervised zero-shot learning for artwork instance recognition","volume":"128","author":"Chiaro Riccardo Del","year":"2019","unstructured":"Riccardo Del Chiaroet al.2019. Webly\u2014Supervised zero-shot learning for artwork instance recognition. Pattern Recogn. Lett. 128 (2019), 420\u2013426.","journal-title":"Pattern Recogn. Lett."},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_2_24_2","first-page":"4171","volume-title":"Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT\u201919)","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT\u201919). 4171\u20134186."},{"issue":"17","key":"e_1_3_2_25_2","doi-asserted-by":"crossref","first-page":"6958","DOI":"10.3390\/su12176958","article-title":"A virtual assistant for natural interactions in museums","volume":"12","author":"Dugulean\u0103 Mihai","year":"2020","unstructured":"Mihai Dugulean\u0103, Victor-Alexandru Briciu, Ionu\u0163-Alexandru Duduman, and Octavian Mihai Machidon. 2020. A virtual assistant for natural interactions in museums. Sustainability 12, 17 (2020), 6958.","journal-title":"Sustainability"},{"key":"e_1_3_2_26_2","first-page":"6616","article-title":"Large-scale adversarial training for vision-and-language representation learning","volume":"33","author":"Gan Zhe","year":"2020","unstructured":"Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. 2020. Large-scale adversarial training for vision-and-language representation learning. Adv. Neural Info. Process. Syst. 33 (2020), 6616\u20136628.","journal-title":"Adv. Neural Info. Process. Syst."},{"key":"e_1_3_2_27_2","first-page":"2296","article-title":"Are you talking to a machine? Dataset and methods for multilingual image question","volume":"28","author":"Gao Haoyuan","year":"2015","unstructured":"Haoyuan Gaoet al.2015. Are you talking to a machine? Dataset and methods for multilingual image question. Adv. Neural Info. Process. Syst. 28 (2015), 2296\u20132304.","journal-title":"Adv. Neural Info. Process. Syst."},{"key":"e_1_3_2_28_2","first-page":"92","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Garcia Noa","year":"2020","unstructured":"Noa Garciaet al.2020. A dataset and baselines for visual question answering on art. In Proceedings of the European Conference on Computer Vision. Springer, 92\u2013108."},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.culher.2019.07.019"},{"key":"e_1_3_2_30_2","first-page":"11125","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"34","author":"Jiang Xiaoze","year":"2020","unstructured":"Xiaoze Jiang, Jing Yu, Zengchang Qin, Yingying Zhuang, Xingxing Zhang, Yue Hu, and Qi Wu. 2020. Dualvd: An adaptive dual encoding model for deep visual understanding in visual dialogue. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11125\u201311132."},{"issue":"5","key":"e_1_3_2_31_2","doi-asserted-by":"crossref","first-page":"528","DOI":"10.3390\/app7050528","article-title":"Artwork identification for 360-degree panoramic images using polyhedron-based rectilinear projection and keypoint shapes","volume":"7","author":"Jin Xun","year":"2017","unstructured":"Xun Jin and Jongweon Kim. 2017. Artwork identification for 360-degree panoramic images using polyhedron-based rectilinear projection and keypoint shapes. Appl. Sci. 7, 5 (2017), 528.","journal-title":"Appl. Sci."},{"key":"e_1_3_2_32_2","volume-title":"Proceedings of the 6th International Conference on Learning Representations (ICLR\u201918)","author":"Kahou Samira Ebrahimi","year":"2018","unstructured":"Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, \u00c1kos K\u00e1d\u00e1r, Adam Trischler, and Yoshua Bengio. 2018. Figure QA: An annotated figure dataset for visual reasoning. In Proceedings of the 6th International Conference on Learning Representations (ICLR\u201918). OpenReview.net. Retrieved from https:\/\/openreview.net\/forum?id=H1mz0OyDz."},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.pmcj.2018.05.002"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_2_36_2","unstructured":"Yinhan Liu Myle Ott Naman Goyal Jingfei Du Mandar Joshi Danqi Chen Omer Levy Mike Lewis Luke Zettlemoyer and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. Retrieved from http:\/\/arxiv.org\/abs\/1907.11692."},{"key":"e_1_3_2_37_2","unstructured":"Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. Retrieved from http:\/\/arxiv.org\/abs\/1410.0210."},{"key":"e_1_3_2_38_2","first-page":"14111","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Marino Kenneth","year":"2021","unstructured":"Kenneth Marino, Xinlei Chen, Devi Parikh, Abhinav Gupta, and Marcus Rohrbach. 2021. Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based VQA. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 14111\u201314121."},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00331"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/2578726.2578791"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458885"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-24794-1_3"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/d16-1264"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1410"},{"key":"e_1_3_2_45_2","first-page":"2953","volume-title":"Proceedings of the 28th International Conference on Neural Information Processing Systems","volume":"2","author":"Ren Mengye","year":"2015","unstructured":"Mengye Renet al.2015. Exploring models and data for image question answering. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 2. 2953\u20132961."},{"key":"e_1_3_2_46_2","unstructured":"Victor Sanh Lysandre Debut Julien Chaumond and Thomas Wolf. 2019. DistilBERT a distilled version of BERT: Smaller faster cheaper and lighter. Retrieved from http:\/\/arxiv.org\/abs\/1910.01108."},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/3092832"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33018876"},{"key":"e_1_3_2_49_2","first-page":"10","volume-title":"Proceedings of the COLING Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH\u201916)","author":"Sheng Shurong","year":"2016","unstructured":"Shurong Sheng, Luc Van Gool, and Marie-Francine Moens. 2016. A dataset for multimodal question answering in the cultural heritage domain. In Proceedings of the COLING Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH\u201916). ACL, 10\u201317."},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v31i1.11164"},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-30645-8_66"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1514"},{"key":"e_1_3_2_53_2","first-page":"3703","volume-title":"Proceedings of the IEEE International Conference on Image Processing (ICIP\u201916)","author":"Tan Wei Ren","year":"2016","unstructured":"Wei Ren Tan, Chee Seng Chan, Hern\u00e1n E. Aguirre, and Kiyoshi Tanaka. 2016. Ceci n\u2019est pas une pipe: A deep convolutional network for fine-art paintings classification. In Proceedings of the IEEE International Conference on Image Processing (ICIP\u201916). IEEE, 3703\u20133707."},{"key":"e_1_3_2_54_2","volume-title":"Proceedings of the 12th International Workshop on Image Analysis for Multimedia Interactive Services","author":"Temmermans Frederik","year":"2011","unstructured":"Frederik Temmermans, Bart Jansen, Rudi Deklerck, Peter Schelkens, and Jan Cornelis. 2011. The mobile museum guide: Artwork recognition with eigenpaintings and surf. In Proceedings of the 12th International Workshop on Image Analysis for Multimedia Interactive Services."},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.3390\/s20030779"},{"key":"e_1_3_2_56_2","volume-title":"Robots, Artificial Intelligence, and Service Automation in Travel, Tourism and Hospitality","author":"Virto Nuria Recuero","year":"2019","unstructured":"Nuria Recuero Virto and Maria Francisca Blasco L\u00f3pez. 2019. Robots, artificial intelligence, and service automation to the core: Remastering experiences at museums. In Robots, Artificial Intelligence, and Service Automation in Travel, Tourism and Hospitality. Emerald Publishing Limited."},{"key":"e_1_3_2_57_2","doi-asserted-by":"crossref","first-page":"1063","DOI":"10.1145\/2187980.2188242","volume-title":"Proceedings of the 21st International Conference on World Wide Web","author":"Vrande\u010di\u0107 Denny","year":"2012","unstructured":"Denny Vrande\u010di\u0107. 2012. Wikidata: A new platform for collaborative data collection. In Proceedings of the 21st International Conference on World Wide Web. 1063\u20131064."},{"key":"e_1_3_2_58_2","volume-title":"Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI\u201917)","author":"Wang Peng","year":"2017","unstructured":"Peng Wanget al.2017. Explicit knowledge-based reasoning for visual question answering. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI\u201917)."},{"key":"e_1_3_2_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2754246"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46478-7_28"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2022.07.009"},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.10"},{"key":"e_1_3_2_63_2","article-title":"Neural-symbolic vqa: Disentangling reasoning from vision and language understanding","volume":"31","author":"Yi Kexin","year":"2018","unstructured":"Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Josh Tenenbaum. 2018. Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. Adv. Neural Info. Process. Syst. 31 (2018).","journal-title":"Adv. Neural Info. Process. Syst."},{"issue":"12","key":"e_1_3_2_64_2","first-page":"4467","article-title":"Multimodal transformer with multi-view visual representation for image captioning","volume":"30","author":"Yu Jun","year":"2019","unstructured":"Jun Yu, Jing Li, Zhou Yu, and Qingming Huang. 2019. Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circ. Syst. Video Technol. 30, 12 (2019), 4467\u20134480.","journal-title":"IEEE Trans. Circ. Syst. Video Technol."},{"key":"e_1_3_2_65_2","unstructured":"Licheng Yu Eunbyung Park Alexander C. Berg and Tamara L. Berg. 2015. Visual Madlibs: Fill in the blank image generation and question answering. Retrieved from http:\/\/arxiv.org\/abs\/1506.00278."},{"key":"e_1_3_2_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00644"},{"key":"e_1_3_2_67_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.202"},{"key":"e_1_3_2_68_2","doi-asserted-by":"crossref","first-page":"2360","DOI":"10.1145\/3447548.3467285","volume-title":"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining","author":"Zheng Wenbo","year":"2021","unstructured":"Wenbo Zheng, Lan Yan, Chao Gou, and Fei-Yue Wang. 2021. Knowledge is power: Hierarchical-knowledge embedded meta-learning for visual reasoning in artistic domains. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2360\u20132368."},{"key":"e_1_3_2_69_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.540"},{"key":"e_1_3_2_70_2","doi-asserted-by":"crossref","unstructured":"Zihao Zhu Jing Yu Yujing Wang Yajing Sun Yue Hu and Qi Wu. 2020. Mucko: Multi-layer cross-modal knowledge reasoning for fact-based visual question answering. Retrieved from https:\/\/arXiv:2006.09073.","DOI":"10.24963\/ijcai.2020\/153"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3590773","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3590773","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:36:26Z","timestamp":1750178186000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3590773"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,7,12]]},"references-count":69,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2023,11,30]]}},"alternative-id":["10.1145\/3590773"],"URL":"https:\/\/doi.org\/10.1145\/3590773","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,7,12]]},"assertion":[{"value":"2022-08-05","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-03-26","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-07-12","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}