{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T20:29:29Z","timestamp":1775593769867,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":73,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,17]],"date-time":"2022-10-17T00:00:00Z","timestamp":1665964800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,17]]},"DOI":"10.1145\/3511808.3557258","type":"proceedings-article","created":{"date-parts":[[2022,10,16]],"date-time":"2022-10-16T01:22:22Z","timestamp":1665883342000},"page":"1996-2006","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["ChiQA"],"prefix":"10.1145","author":[{"given":"Bingning","family":"Wang","sequence":"first","affiliation":[{"name":"Tencent Inc., Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Feiyang","family":"Lv","sequence":"additional","affiliation":[{"name":"Tencent Inc., Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ting","family":"Yao","sequence":"additional","affiliation":[{"name":"Tencent Inc., Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jin","family":"Ma","sequence":"additional","affiliation":[{"name":"Tencent Inc., Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yu","family":"Luo","sequence":"additional","affiliation":[{"name":"Tencent Inc., Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Haijin","family":"Liang","sequence":"additional","affiliation":[{"name":"Tencent Inc., Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,10,17]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00522"},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3209978.3209985"},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.12"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.279"},{"key":"e_1_3_2_1_6_1","volume-title":"David M Blei, and Michael I Jordan.","author":"Barnard Kobus","year":"2003","unstructured":"Kobus Barnard , Pinar Duygulu , David Forsyth , Nando De Freitas , David M Blei, and Michael I Jordan. 2003 . Matching words and pictures. (2003). Kobus Barnard, Pinar Duygulu, David Forsyth, Nando De Freitas, David M Blei, and Michael I Jordan. 2003. Matching words and pictures. (2003)."},{"key":"e_1_3_2_1_7_1","volume-title":"Proceedings of the 2013 conference on empirical methods in natural language processing. 1533--1544","author":"Berant Jonathan","year":"2013","unstructured":"Jonathan Berant , Andrew Chou , Roy Frostig , and Percy Liang . 2013 . Semantic parsing on freebase from question-answer pairs . In Proceedings of the 2013 conference on empirical methods in natural language processing. 1533--1544 . Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing. 1533--1544."},{"key":"e_1_3_2_1_8_1","volume-title":"A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326","author":"Bowman Samuel R","year":"2015","unstructured":"Samuel R Bowman , Gabor Angeli , Christopher Potts , and Christopher D Manning . 2015. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326 ( 2015 ). Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326 (2015)."},{"key":"e_1_3_2_1_9_1","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell etal 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901.  Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901."},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1273496.1273513"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58577-8_7"},{"key":"e_1_3_2_1_13_1","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","volume":"1","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers). 4171--4186. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186."},{"key":"e_1_3_2_1_14_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly etal 2021. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR (2021).  Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR (2021)."},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1346"},{"key":"e_1_3_2_1_16_1","volume-title":"Are you talking to a machine? dataset and methods for multilingual image question. Advances in neural information processing systems 28","author":"Gao Haoyuan","year":"2015","unstructured":"Haoyuan Gao , Junhua Mao , Jie Zhou , Zhiheng Huang , Lei Wang , and Wei Xu. 2015. Are you talking to a machine? dataset and methods for multilingual image question. Advances in neural information processing systems 28 ( 2015 ). Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. 2015. Are you talking to a machine? dataset and methods for multilingual image question. Advances in neural information processing systems 28 (2015)."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.1422953112"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.670"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/1460690.1460714"},{"key":"e_1_3_2_1_20_1","volume-title":"Open-domain question answering. Ph. D. Dissertation","author":"Greenwood Mark Andrew","unstructured":"Mark Andrew Greenwood . 2005. Open-domain question answering. Ph. D. Dissertation . University of Sheffield , UK. Mark Andrew Greenwood. 2005. Open-domain question answering. Ph. D. Dissertation. University of Sheffield, UK."},{"key":"e_1_3_2_1_21_1","volume-title":"Wukong: 100 Million Large- scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework. arXiv preprint arXiv:2202.06767","author":"Gu Jiaxi","year":"2022","unstructured":"Jiaxi Gu , Xiaojun Meng , Guansong Lu , Lu Hou , Minzhe Niu , Hang Xu , Xiaodan Liang , Wei Zhang , Xin Jiang , and Chunjing Xu. 2022. Wukong: 100 Million Large- scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework. arXiv preprint arXiv:2202.06767 ( 2022 ). Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Minzhe Niu, Hang Xu, Xiaodan Liang, Wei Zhang, Xin Jiang, and Chunjing Xu. 2022. Wukong: 100 Million Large- scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework. arXiv preprint arXiv:2202.06767 (2022)."},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W18-2605"},{"key":"e_1_3_2_1_24_1","volume-title":"Long Short-Term Memory. Neural computation 9 8","author":"Hochreiter Sepp","year":"1997","unstructured":"Sepp Hochreiter and J\u00fcrgen Schmidhuber . 1997. Long Short-Term Memory. Neural computation 9 8 ( 1997 ), 1735--80. Sepp Hochreiter and J\u00fcrgen Schmidhuber. 1997. Long Short-Term Memory. Neural computation 9 8 (1997), 1735--80."},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475452"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00147"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.195"},{"key":"e_1_3_2_1_28_1","volume-title":"Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067","author":"Hudson Drew A","year":"2018","unstructured":"Drew A Hudson and Christopher D Manning . 2018. Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067 ( 2018 ). Drew A Hudson and Christopher D Manning. 2018. Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067 (2018)."},{"key":"e_1_3_2_1_29_1","unstructured":"Drew A Hudson and Christopher D Manning. 2019. GQA: a new dataset for compositional question answering over real-world images. (2019).  Drew A Hudson and Christopher D Manning. 2019. GQA: a new dataset for compositional question answering over real-world images. (2019)."},{"key":"e_1_3_2_1_30_1","unstructured":"Yuqi Huo Manli Zhang Guangzhen Liu Haoyu Lu Yizhao Gao Guoxing Yang Jingyuan Wen Heng Zhang Baogui Xu Weihao Zheng etal 2021. WenLan: Bridging vision and language by large-scale multi-modal pre-training. arXiv preprint arXiv:2103.06561 (2021).  Yuqi Huo Manli Zhang Guangzhen Liu Haoyu Lu Yizhao Gao Guoxing Yang Jingyuan Wen Heng Zhang Baogui Xu Weihao Zheng et al. 2021. WenLan: Bridging vision and language by large-scale multi-modal pre-training. arXiv preprint arXiv:2103.06561 (2021)."},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1215"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.215"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_3_2_1_34_1","volume-title":"Kingma and Jimmy Ba","author":"Diederik","year":"2014","unstructured":"Diederik P. Kingma and Jimmy Ba . 2014 . Adam : A Method for Stochastic Optimization. CoRR abs\/1412.6980 (2014). Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs\/1412.6980 (2014)."},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"crossref","unstructured":"Ranjay Krishna Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen Yannis Kalantidis Li-Jia Li David A Shamma etal 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 1 (2017) 32--73.  Ranjay Krishna Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen Yannis Kalantidis Li-Jia Li David A Shamma et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 1 (2017) 32--73.","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_3_2_1_36_1","volume-title":"System and technology advancements in distance learning. IGI Global (701 E. Chocolate Avenue","author":"Kumar Vivek","unstructured":"Vivek Kumar and Fuhua Lin . 2013. System and technology advancements in distance learning. IGI Global (701 E. Chocolate Avenue , Hershey, Pennsylvania , 17033, USA) . Vivek Kumar and Fuhua Lin. 2013. System and technology advancements in distance learning. IGI Global (701 E. Chocolate Avenue, Hershey, Pennsylvania, 17033, USA)."},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00276"},{"key":"e_1_3_2_1_38_1","volume-title":"Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems 34","author":"Li Junnan","year":"2021","unstructured":"Junnan Li , Ramprasaath Selvaraju , Akhilesh Gotmare , Shafiq Joty , Caiming Xiong , and Steven Chu Hong Hoi . 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems 34 ( 2021 ). Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems 34 (2021)."},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58577-8_8"},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1093\/screen\/17.3.7"},{"key":"e_1_3_2_1_42_1","volume-title":"A multi-world approach to question answering about real-world scenes based on uncertain input. Advances in neural information processing systems 27","author":"Malinowski Mateusz","year":"2014","unstructured":"Mateusz Malinowski and Mario Fritz . 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. Advances in neural information processing systems 27 ( 2014 ). Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. Advances in neural information processing systems 27 (2014)."},{"key":"e_1_3_2_1_43_1","volume-title":"Proceedings of the 27th International Conference on Computational Linguistics. 973--983","author":"Nakanishi Mao","year":"2018","unstructured":"Mao Nakanishi , Tetsunori Kobayashi , and Yoshihiko Hayashi . 2018 . Answerable or not: Devising a dataset for extending machine reading comprehension . In Proceedings of the 27th International Conference on Computational Linguistics. 973--983 . Mao Nakanishi, Tetsunori Kobayashi, and Yoshihiko Hayashi. 2018. Answerable or not: Devising a dataset for extending machine reading comprehension. In Proceedings of the 27th International Conference on Computational Linguistics. 973--983."},{"key":"e_1_3_2_1_44_1","volume-title":"MS MARCO: A human generated machine reading comprehension dataset. In CoCo@ NIPS.","author":"Nguyen Tri","year":"2016","unstructured":"Tri Nguyen , Mir Rosenberg , Xia Song , Jianfeng Gao , Saurabh Tiwary , Rangan Majumder , and Li Deng . 2016 . MS MARCO: A human generated machine reading comprehension dataset. In CoCo@ NIPS. Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. In CoCo@ NIPS."},{"key":"e_1_3_2_1_45_1","volume-title":"Emily M Bender, Emily Denton, and Alex Hanna.","author":"Paullada Amandalynne","year":"2021","unstructured":"Amandalynne Paullada , Inioluwa Deborah Raji , Emily M Bender, Emily Denton, and Alex Hanna. 2021 . Data and its (dis) contents: A survey of dataset development and use in machine learning research. Patterns 2, 11 (2021). Amandalynne Paullada, Inioluwa Deborah Raji, Emily M Bender, Emily Denton, and Alex Hanna. 2021. Data and its (dis) contents: A survey of dataset development and use in machine learning research. Patterns 2, 11 (2021)."},{"key":"e_1_3_2_1_46_1","unstructured":"Anthony Valiant Phillips. 1960. Artificial Intelligence Project-RLE and MIT Computation Center Memo 16-A Question-Answering Routine'. (1960).  Anthony Valiant Phillips. 1960. Artificial Intelligence Project-RLE and MIT Computation Center Memo 16-A Question-Answering Routine'. (1960)."},{"key":"e_1_3_2_1_47_1","volume-title":"Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966","author":"Qi Di","year":"2020","unstructured":"Di Qi , Lin Su , Jia Song , Edward Cui , Taroon Bharti , and Arun Sacheti . 2020 . Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966 (2020). Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. 2020. Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966 (2020)."},{"key":"e_1_3_2_1_48_1","volume-title":"International Conference on Machine Learning. PMLR, 8748--8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford , Jong Wook Kim , Chris Hallacy , Aditya Ramesh , Gabriel Goh , Sandhini Agarwal , Girish Sastry , Amanda Askell , Pamela Mishkin , Jack Clark , 2021 . Learning transferable visual models from natural language supervision . In International Conference on Machine Learning. PMLR, 8748--8763 . Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763."},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1264"},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.131"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV51458.2022.00257"},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"crossref","unstructured":"Amanpreet Singh Vivek Natarajan Meet Shah Yu Jiang Xinlei Chen Dhruv Batra Devi Parikh and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 8317--8326.  Amanpreet Singh Vivek Natarajan Meet Shah Yu Jiang Xinlei Chen Dhruv Batra Devi Parikh and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 8317--8326.","DOI":"10.1109\/CVPR.2019.00851"},{"key":"e_1_3_2_1_53_1","volume-title":"Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530","author":"Su Weijie","year":"2019","unstructured":"Weijie Su , Xizhou Zhu , Yue Cao , Bin Li , Lewei Lu , Furu Wei , and Jifeng Dai . 2019 . Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019). Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)."},{"key":"e_1_3_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W16-6001"},{"key":"e_1_3_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1453"},{"key":"e_1_3_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-1075"},{"key":"e_1_3_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1644"},{"key":"e_1_3_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.746"},{"key":"e_1_3_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-2071"},{"key":"e_1_3_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1514"},{"key":"e_1_3_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.501"},{"key":"e_1_3_2_1_62_1","volume-title":"Show and tell: Lessons learned from the 2015 mscoco image captioning challenge","author":"Vinyals Oriol","year":"2016","unstructured":"Oriol Vinyals , Alexander Toshev , Samy Bengio , and Dumitru Erhan . 2016. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge . IEEE transactions on pattern analysis and machine intelligence 39, 4 ( 2016 ), 652--663. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2016. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence 39, 4 (2016), 652--663."},{"key":"e_1_3_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/3442381.3449993"},{"key":"e_1_3_2_1_64_1","doi-asserted-by":"crossref","unstructured":"Peng Wang Qi Wu Chunhua Shen Anthony R Dick and Anton van den Hengel. 2017. Explicit Knowledge-based Reasoning for Visual Question Answering. In IJCAI.  Peng Wang Qi Wu Chunhua Shen Anthony R Dick and Anton van den Hengel. 2017. Explicit Knowledge-based Reasoning for Visual Question Answering. In IJCAI.","DOI":"10.24963\/ijcai.2017\/179"},{"key":"e_1_3_2_1_65_1","doi-asserted-by":"crossref","unstructured":"Thomas Wolf Lysandre Debut Victor Sanh Julien Chaumond Clement Delangue Anthony Moi Pierric Cistac Tim Rault R\u00e9mi Louf Morgan Funtowicz etal 2019. Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).  Thomas Wolf Lysandre Debut Victor Sanh Julien Chaumond Clement Delangue Anthony Moi Pierric Cistac Tim Rault R\u00e9mi Louf Morgan Funtowicz et al. 2019. Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"e_1_3_2_1_66_1","volume-title":"Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706","author":"Xie Ning","year":"2019","unstructured":"Ning Xie , Farley Lai , Derek Doran , and Asim Kadav . 2019. Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706 ( 2019 ). Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. 2019. Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706 (2019)."},{"key":"e_1_3_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1237"},{"key":"e_1_3_2_1_68_1","volume-title":"Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. Advances in neural information processing systems 31","author":"Yi Kexin","year":"2018","unstructured":"Kexin Yi , Jiajun Wu , Chuang Gan , Antonio Torralba , Pushmeet Kohli , and Josh Tenenbaum . 2018. Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. Advances in neural information processing systems 31 ( 2018 ). Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Josh Tenenbaum. 2018. Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. Advances in neural information processing systems 31 (2018)."},{"key":"e_1_3_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00166"},{"key":"e_1_3_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.7005"},{"key":"e_1_3_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.540"},{"key":"e_1_3_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1145\/2783258.2783316"},{"key":"e_1_3_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.211"}],"event":{"name":"CIKM '22: The 31st ACM International Conference on Information and Knowledge Management","location":"Atlanta GA USA","acronym":"CIKM '22","sponsor":["SIGWEB ACM Special Interest Group on Hypertext, Hypermedia, and Web","SIGIR ACM Special Interest Group on Information Retrieval"]},"container-title":["Proceedings of the 31st ACM International Conference on Information &amp; Knowledge Management"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3511808.3557258","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3511808.3557258","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:07Z","timestamp":1750182547000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3511808.3557258"}},"subtitle":["A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding"],"short-title":[],"issued":{"date-parts":[[2022,10,17]]},"references-count":73,"alternative-id":["10.1145\/3511808.3557258","10.1145\/3511808"],"URL":"https:\/\/doi.org\/10.1145\/3511808.3557258","relation":{},"subject":[],"published":{"date-parts":[[2022,10,17]]},"assertion":[{"value":"2022-10-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}