{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,11]],"date-time":"2026-05-11T22:58:11Z","timestamp":1778540291759,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":48,"publisher":"ACM","license":[{"start":{"date-parts":[[2020,10,12]],"date-time":"2020-10-12T00:00:00Z","timestamp":1602460800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2020,10,12]]},"DOI":"10.1145\/3394171.3413761","type":"proceedings-article","created":{"date-parts":[[2020,10,12]],"date-time":"2020-10-12T13:10:18Z","timestamp":1602508218000},"page":"2345-2354","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":134,"title":["Medical Visual Question Answering via Conditional Reasoning"],"prefix":"10.1145","author":[{"given":"Li-Ming","family":"Zhan","sequence":"first","affiliation":[{"name":"The Hong Kong Polytechnic University, Hong Kong, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Bo","family":"Liu","sequence":"additional","affiliation":[{"name":"The Hong Kong Polytechnic University, Hong Kong, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lu","family":"Fan","sequence":"additional","affiliation":[{"name":"The Hong Kong Polytechnic University, Hong Kong, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jiaxin","family":"Chen","sequence":"additional","affiliation":[{"name":"The Hong Kong Polytechnic University, Hong Kong, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiao-Ming","family":"Wu","sequence":"additional","affiliation":[{"name":"The Hong Kong Polytechnic University, Hong Kong, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2020,10,12]]},"reference":[{"key":"e_1_3_2_2_1_1","volume-title":"Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings","author":"Abacha Asma Ben","year":"2018","unstructured":"Asma Ben Abacha , Soumya Gayen , Jason J. Lau , Sivaramakrishnan Rajaraman , and Dina Demner-Fushman . 2018 . NLM at ImageCLEF 2018 Visual Question Answering in the Medical Domain . In Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings , Vol. 2125). CEUR-WS.org, Avignon, France. Asma Ben Abacha, Soumya Gayen, Jason J. Lau, Sivaramakrishnan Rajaraman, and Dina Demner-Fushman. 2018. NLM at ImageCLEF 2018 Visual Question Answering in the Medical Domain. In Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 2125). CEUR-WS.org, Avignon, France."},{"key":"e_1_3_2_2_2_1","volume-title":"Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings","author":"Abacha Asma Ben","year":"2019","unstructured":"Asma Ben Abacha , Sadid A. Hasan , Vivek V. Datla , Joey Liu , Dina Demner-Fushman , and Henning M\u00fc ller. 2019 . VQA-Med: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019 . In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings , Vol. 2380). CEUR-WS.org, Lugano, Switzerland. Asma Ben Abacha, Sadid A. Hasan, Vivek V. Datla, Joey Liu, Dina Demner-Fushman, and Henning M\u00fc ller. 2019. VQA-Med: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019. In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 2380). CEUR-WS.org, Lugano, Switzerland."},{"key":"e_1_3_2_2_3_1","volume-title":"Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society","author":"Anderson Peter","year":"2018","unstructured":"Peter Anderson , Xiaodong He , Chris Buehler , Damien Teney , Mark Johnson , Stephen Gould , and Lei Zhang . 2018 . Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society , Salt Lake City, UT, USA, 6077--6086. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society, Salt Lake City, UT, USA, 6077--6086."},{"key":"e_1_3_2_2_4_1","volume-title":"Neural Module Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society","author":"Andreas Jacob","year":"2016","unstructured":"Jacob Andreas , Marcus Rohrbach , Trevor Darrell , and Dan Klein . 2016 . Neural Module Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society , Las Vegas, NV, USA, 39--48. Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural Module Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society, Las Vegas, NV, USA, 39--48."},{"key":"e_1_3_2_2_5_1","volume-title":"VQA: Visual Question Answering. In IEEE International Conference on Computer Vision, ICCV. IEEE Computer Society","author":"Antol Stanislaw","year":"2015","unstructured":"Stanislaw Antol , Aishwarya Agrawal , Jiasen Lu , Margaret Mitchell , Dhruv Batra , C. Lawrence Zitnick , and Devi Parikh . 2015 . VQA: Visual Question Answering. In IEEE International Conference on Computer Vision, ICCV. IEEE Computer Society , Santiago, Chile, 2425--2433. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In IEEE International Conference on Computer Vision, ICCV. IEEE Computer Society, Santiago, Chile, 2425--2433."},{"key":"e_1_3_2_2_6_1","volume-title":"ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering. arXiv e-prints (Nov","author":"Chen Kan","year":"2015","unstructured":"Kan Chen , Jiang Wang , Liang-Chieh Chen , Haoyuan Gao , Wei Xu , and Ram Nevatia . 2015. ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering. arXiv e-prints (Nov . 2015 ), arXiv:1511.05960. Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram Nevatia. 2015. ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering. arXiv e-prints (Nov. 2015), arXiv:1511.05960."},{"key":"e_1_3_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-4012"},{"key":"e_1_3_2_2_8_1","volume-title":"Fast Parameter Adaptation for Few-shot Image Captioning and Visual Question Answering. In 2018 ACM Multimedia Conference on Multimedia Conference, MM. ACM","author":"Dong Xuanyi","year":"2018","unstructured":"Xuanyi Dong , Linchao Zhu , De Zhang , Yi Yang , and Fei Wu . 2018 . Fast Parameter Adaptation for Few-shot Image Captioning and Visual Question Answering. In 2018 ACM Multimedia Conference on Multimedia Conference, MM. ACM , Seoul, Republic of Korea, 54--62. Xuanyi Dong, Linchao Zhu, De Zhang, Yi Yang, and Fei Wu. 2018. Fast Parameter Adaptation for Few-shot Image Captioning and Visual Question Answering. In 2018 ACM Multimedia Conference on Multimedia Conference, MM. ACM, Seoul, Republic of Korea, 54--62."},{"key":"e_1_3_2_2_9_1","volume-title":"Proceedings of the 34th International Conference on Machine Learning, ICML (Proceedings of Machine Learning Research","volume":"1135","author":"Finn Chelsea","year":"2017","unstructured":"Chelsea Finn , Pieter Abbeel , and Sergey Levine . 2017 . Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks . In Proceedings of the 34th International Conference on Machine Learning, ICML (Proceedings of Machine Learning Research , Vol. 70). PMLR, Sydney, NSW, Australia, 1126-- 1135 . Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, ICML (Proceedings of Machine Learning Research, Vol. 70). PMLR, Sydney, NSW, Australia, 1126--1135."},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1044"},{"key":"e_1_3_2_2_11_1","volume-title":"Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society","author":"Goyal Yash","year":"2017","unstructured":"Yash Goyal , Tejas Khot , Douglas Summers-Stay , Dhruv Batra , and Devi Parikh . 2017 . Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society , Honolulu, HI, USA, 6325--6334. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society, Honolulu, HI, USA, 6325--6334."},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.93"},{"key":"e_1_3_2_2_14_1","volume-title":"Compositional Attention Networks for Machine Reasoning. In 6th International Conference on Learning Representations, ICLR. OpenReview.net","author":"Drew","unstructured":"Drew A. Hudson and Christopher D. Manning. 2018 . Compositional Attention Networks for Machine Reasoning. In 6th International Conference on Learning Representations, ICLR. OpenReview.net , Vancouver, BC, Canada. Drew A. Hudson and Christopher D. Manning. 2018. Compositional Attention Networks for Machine Reasoning. In 6th International Conference on Learning Representations, ICLR. OpenReview.net, Vancouver, BC, Canada."},{"key":"e_1_3_2_2_15_1","volume-title":"GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. Computer Vision Foundation \/ IEEE","author":"Drew","unstructured":"Drew A. Hudson and Christopher D. Manning. 2019 . GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. Computer Vision Foundation \/ IEEE , Long Beach, CA, USA, 6700--6709. Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. Computer Vision Foundation \/ IEEE, Long Beach, CA, USA, 6700--6709."},{"key":"e_1_3_2_2_16_1","volume-title":"CLEF (Lecture Notes in Computer Science","volume":"334","author":"Ionescu Bogdan","year":"2018","unstructured":"Bogdan Ionescu , Henning M\u00fc ller, Mauricio Villegas , Alba Garcia Seco de Herrera , Carsten Eickhoff , Vincent Andrearczyk , Yashin Dicente Cid , Vitali Liauchuk , Vassili Kovalev , Sadid A. Hasan , Yuan Ling , Oladimeji Farri , Joey Liu , Matthew P. Lungren , Duc-Tien Dang-Nguyen , Luca Piras , Michael Riegler , Liting Zhou , Mathias Lux , and Cathal Gurrin . 2018 . Overview of ImageCLEF 2018: Challenges, Datasets and Evaluation. In Experimental IR Meets Multilinguality, Multimodality, and Interaction - 9th International Conference of the CLEF Association , CLEF (Lecture Notes in Computer Science , Vol. 11018). Springer, Avignon, France, 309-- 334 . Bogdan Ionescu, Henning M\u00fc ller, Mauricio Villegas, Alba Garcia Seco de Herrera, Carsten Eickhoff, Vincent Andrearczyk, Yashin Dicente Cid, Vitali Liauchuk, Vassili Kovalev, Sadid A. Hasan, Yuan Ling, Oladimeji Farri, Joey Liu, Matthew P. Lungren, Duc-Tien Dang-Nguyen, Luca Piras, Michael Riegler, Liting Zhou, Mathias Lux, and Cathal Gurrin. 2018. Overview of ImageCLEF 2018: Challenges, Datasets and Evaluation. In Experimental IR Meets Multilinguality, Multimodality, and Interaction - 9th International Conference of the CLEF Association, CLEF (Lecture Notes in Computer Science, Vol. 11018). Springer, Avignon, France, 309--334."},{"key":"e_1_3_2_2_17_1","volume-title":"Pythia v0.1: the Winning Entry to the VQA Challenge","author":"Jiang Yu","year":"2018","unstructured":"Yu Jiang , Vivek Natarajan , Xinlei Chen , Marcus Rohrbach , Dhruv Batra , and Devi Parikh . 2018. Pythia v0.1: the Winning Entry to the VQA Challenge 2018 . arXiv e-prints (July 2018), arXiv:1807.09956. Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. 2018. Pythia v0.1: the Winning Entry to the VQA Challenge 2018. arXiv e-prints (July 2018), arXiv:1807.09956."},{"key":"e_1_3_2_2_18_1","volume-title":"CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society","author":"Johnson Justin","year":"1988","unstructured":"Justin Johnson , Bharath Hariharan , Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. 2017 . CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society , Honolulu, HI, USA , 1988 --1997. Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. 2017. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society, Honolulu, HI, USA, 1988--1997."},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2017.06.005"},{"key":"e_1_3_2_2_20_1","volume-title":"Bilinear Attention Networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, NeurIPS. NeurIPS, Montr\u00e9 al, Canada, 1571--1581","author":"Kim Jin-Hwa","year":"2018","unstructured":"Jin-Hwa Kim , Jaehyun Jun , and Byoung-Tak Zhang . 2018 . Bilinear Attention Networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, NeurIPS. NeurIPS, Montr\u00e9 al, Canada, 1571--1581 . Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear Attention Networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, NeurIPS. NeurIPS, Montr\u00e9 al, Canada, 1571--1581."},{"key":"e_1_3_2_2_21_1","volume-title":"Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR, Conference Track Proceedings. OpenReview.net","author":"Diederik","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015 . Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR, Conference Track Proceedings. OpenReview.net , San Diego, CA, USA. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR, Conference Track Proceedings. OpenReview.net, San Diego, CA, USA."},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.5555\/2886521.2886636"},{"key":"e_1_3_2_2_23_1","volume-title":"Asma Ben Abacha, and Dina Demner-Fushman","author":"Lau Jason J","year":"2018","unstructured":"Jason J Lau , Soumya Gayen , Asma Ben Abacha, and Dina Demner-Fushman . 2018 . A dataset of clinically generated visual questions and answers about radiology images. Scientific data, Vol. 5 , 1 (2018), 1--10. Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. 2018. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, Vol. 5, 1 (2018), 1--10."},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.media.2017.07.005"},{"key":"e_1_3_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3350993"},{"key":"e_1_3_2_2_26_1","volume-title":"Hierarchical Question-Image Co-Attention for Visual Question Answering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems, NeurIPS","author":"Lu Jiasen","year":"2016","unstructured":"Jiasen Lu , Jianwei Yang , Dhruv Batra , and Devi Parikh . 2016 . Hierarchical Question-Image Co-Attention for Visual Question Answering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems, NeurIPS . Barcelona, Spain, 289--297. Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical Question-Image Co-Attention for Visual Question Answering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems, NeurIPS. Barcelona, Spain, 289--297."},{"key":"e_1_3_2_2_27_1","volume-title":"7th International Conference on Learning Representations, ICLR. OpenReview.net","author":"Mao Jiayuan","year":"2019","unstructured":"Jiayuan Mao , Chuang Gan , Pushmeet Kohli , Joshua B. Tenenbaum , and Jiajun Wu . 2019 . The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision . In 7th International Conference on Learning Representations, ICLR. OpenReview.net , New Orleans, LA, USA. Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, and Jiajun Wu. 2019. The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision. In 7th International Conference on Learning Representations, ICLR. OpenReview.net, New Orleans, LA, USA."},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00519"},{"key":"e_1_3_2_2_29_1","volume-title":"Proceedings, Part I (Lecture Notes in Computer Science","volume":"59","author":"Masci Jonathan","year":"2011","unstructured":"Jonathan Masci , Ueli Meier , Dan C. Ciresan , and J\u00fc rgen Schmidhuber . 2011 . Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction. In Artificial Neural Networks and Machine Learning - ICANN 2011 - 21st International Conference on Artificial Neural Networks , Proceedings, Part I (Lecture Notes in Computer Science , Vol. 6791). Springer, Espoo, Finland, 52-- 59 . Jonathan Masci, Ueli Meier, Dan C. Ciresan, and J\u00fc rgen Schmidhuber. 2011. Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction. In Artificial Neural Networks and Machine Learning - ICANN 2011 - 21st International Conference on Artificial Neural Networks, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 6791). Springer, Espoo, Finland, 52--59."},{"key":"e_1_3_2_2_30_1","volume-title":"Tran","author":"Nguyen Binh D.","year":"2019","unstructured":"Binh D. Nguyen , Thanh-Toan Do , Binh X. Nguyen , Tuong Do , Erman Tjiputra , and Quang D . Tran . 2019 . Overcoming Data Limitation in Medical Visual Question Answering. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2019 - 22nd International Conference, Part IV (Lecture Notes in Computer Science , Vol. 11767). Springer, Shenzhen, China, 522-- 530 . Binh D. Nguyen, Thanh-Toan Do, Binh X. Nguyen, Tuong Do, Erman Tjiputra, and Quang D. Tran. 2019. Overcoming Data Limitation in Medical Visual Question Answering. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2019 - 22nd International Conference, Part IV (Lecture Notes in Computer Science, Vol. 11767). Springer, Shenzhen, China, 522--530."},{"key":"e_1_3_2_2_31_1","volume-title":"High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, NeurIPS","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke , Sam Gross , Francisco Massa , Adam Lerer , James Bradbury , Gregory Chanan , Trevor Killeen , Zeming Lin , Natalia Gimelshein , Luca Antiga , Alban Desmaison , Andreas K\u00f6 pf, Edward Yang , Zachary DeVito , Martin Raison , Alykhan Tejani , Sasank Chilamkurthy , Benoit Steiner , Lu Fang , Junjie Bai , and Soumith Chintala . 2019 . PyTorch: An Imperative Style , High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, NeurIPS . Vancouver, BC, Canada, 8024--8035. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K\u00f6 pf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, NeurIPS. Vancouver, BC, Canada, 8024--8035."},{"key":"e_1_3_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3350925"},{"key":"e_1_3_2_2_33_1","volume-title":"Manning","author":"Pennington Jeffrey","year":"2014","unstructured":"Jeffrey Pennington , Richard Socher , and Christopher D . Manning . 2014 . Glove : Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, A meeting of SIGDAT, a Special Interest Group of the ACL. ACL, Doha, Qatar, 1532--1543. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, A meeting of SIGDAT, a Special Interest Group of the ACL. ACL, Doha, Qatar, 1532--1543."},{"key":"e_1_3_2_2_34_1","volume-title":"Courville","author":"Perez Ethan","year":"2018","unstructured":"Ethan Perez , Florian Strub , Harm de Vries , Vincent Dumoulin , and Aaron C . Courville . 2018 . FiLM: Visual Reasoning with a General Conditioning Layer. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence. AAAI Press , New Orleans, Louisiana, USA, 3942--3951. Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. 2018. FiLM: Visual Reasoning with a General Conditioning Layer. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence. AAAI Press, New Orleans, Louisiana, USA, 3942--3951."},{"key":"e_1_3_2_2_35_1","volume-title":"Transfusion: Understanding Transfer Learning for Medical Imaging. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, NeurIPS","author":"Raghu Maithra","year":"2019","unstructured":"Maithra Raghu , Chiyuan Zhang , Jon M. Kleinberg , and Samy Bengio . 2019 . Transfusion: Understanding Transfer Learning for Medical Imaging. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, NeurIPS . Vancouver, BC, Canada, 3342--3352. Maithra Raghu, Chiyuan Zhang, Jon M. Kleinberg, and Samy Bengio. 2019. Transfusion: Understanding Transfer Learning for Medical Imaging. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, NeurIPS. Vancouver, BC, Canada, 3342--3352."},{"key":"e_1_3_2_2_36_1","unstructured":"Shaoqing Ren Kaiming He Ross B. Girshick and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems NeurIPS. Montreal Quebec Canada 91--99.  Shaoqing Ren Kaiming He Ross B. Girshick and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems NeurIPS. Montreal Quebec Canada 91--99."},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1041"},{"key":"e_1_3_2_2_38_1","volume-title":"Deep Multimodal Learning for Medical Visual Question Answering. In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings","author":"Shi Lei","unstructured":"Lei Shi , Feifan Liu , and Max P. Rosen . 2019 . Deep Multimodal Learning for Medical Visual Question Answering. In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings , Vol. 2380). CEUR-WS.org, Lugano, Switzerland. Lei Shi, Feifan Liu, and Max P. Rosen. 2019. Deep Multimodal Learning for Medical Visual Question Answering. In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 2380). CEUR-WS.org, Lugano, Switzerland."},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01072"},{"key":"e_1_3_2_2_40_1","volume-title":"Highway Networks. arXiv e-prints (May","author":"Srivastava Rupesh Kumar","year":"2015","unstructured":"Rupesh Kumar Srivastava , Klaus Greff , and J\u00fcrgen Schmidhuber . 2015. Highway Networks. arXiv e-prints (May 2015 ), arXiv:1505.00387. Rupesh Kumar Srivastava, Klaus Greff, and J\u00fcrgen Schmidhuber. 2015. Highway Networks. arXiv e-prints (May 2015), arXiv:1505.00387."},{"key":"e_1_3_2_2_41_1","volume-title":"Proceedings, Part VII (Lecture Notes in Computer Science","volume":"466","author":"Xu Huijuan","year":"2016","unstructured":"Huijuan Xu and Kate Saenko . 2016 . Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering. In Computer Vision - ECCV 2016 - 14th European Conference , Proceedings, Part VII (Lecture Notes in Computer Science , Vol. 9911). Springer, Amsterdam, The Netherlands, 451-- 466 . Huijuan Xu and Kate Saenko. 2016. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering. In Computer Vision - ECCV 2016 - 14th European Conference, Proceedings, Part VII (Lecture Notes in Computer Science, Vol. 9911). Springer, Amsterdam, The Netherlands, 451--466."},{"key":"e_1_3_2_2_42_1","volume-title":"Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings","author":"Yan Xin","year":"2019","unstructured":"Xin Yan , Lin Li , Chulin Xie , Jun Xiao , and Lin Gu . 2019 . Zhejiang University at ImageCLEF 2019 Visual Question Answering in the Medical Domain . In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings , Vol. 2380). CEUR-WS.org, Lugano, Switzerland. Xin Yan, Lin Li, Chulin Xie, Jun Xiao, and Lin Gu. 2019. Zhejiang University at ImageCLEF 2019 Visual Question Answering in the Medical Domain. In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 2380). CEUR-WS.org, Lugano, Switzerland."},{"key":"e_1_3_2_2_43_1","volume-title":"Stacked Attention Networks for Image Question Answering. In 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society","author":"Yang Zichao","unstructured":"Zichao Yang , Xiaodong He , Jianfeng Gao , Li Deng , and Alexander J. Smola . 2016 . Stacked Attention Networks for Image Question Answering. In 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society , Las Vegas, NV, USA, 21--29. Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked Attention Networks for Image Question Answering. In 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, Las Vegas, NV, USA, 21--29."},{"key":"e_1_3_2_2_44_1","volume-title":"CLEVRER: Collision Events for Video Representation and Reasoning. In 8th International Conference on Learning Representations, ICLR. OpenReview.net, Addis Ababa, Ethiopia.","author":"Yi Kexin","unstructured":"Kexin Yi , Chuang Gan , Yunzhu Li , Pushmeet Kohli , Jiajun Wu , Antonio Torralba , and Joshua B. Tenenbaum . 2020 . CLEVRER: Collision Events for Video Representation and Reasoning. In 8th International Conference on Learning Representations, ICLR. OpenReview.net, Addis Ababa, Ethiopia. Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B. Tenenbaum. 2020. CLEVRER: Collision Events for Video Representation and Reasoning. In 8th International Conference on Learning Representations, ICLR. OpenReview.net, Addis Ababa, Ethiopia."},{"key":"e_1_3_2_2_45_1","volume-title":"Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, NeurIPS. Montr\u00e9 al, Canada, 1039--1050","author":"Yi Kexin","year":"2018","unstructured":"Kexin Yi , Jiajun Wu , Chuang Gan , Antonio Torralba , Pushmeet Kohli , and Josh Tenenbaum . 2018 . Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding . In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, NeurIPS. Montr\u00e9 al, Canada, 1039--1050 . Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Josh Tenenbaum. 2018. Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, NeurIPS. Montr\u00e9 al, Canada, 1039--1050."},{"key":"e_1_3_2_2_46_1","volume-title":"Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering. In IEEE International Conference on Computer Vision, ICCV. IEEE Computer Society","author":"Yu Zhou","year":"2017","unstructured":"Zhou Yu , Jun Yu , Jianping Fan , and Dacheng Tao . 2017 . Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering. In IEEE International Conference on Computer Vision, ICCV. IEEE Computer Society , Venice, Italy , 1839--1848. Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering. In IEEE International Conference on Computer Vision, ICCV. IEEE Computer Society, Venice, Italy, 1839--1848."},{"key":"e_1_3_2_2_47_1","volume-title":"Simple Baseline for Visual Question Answering. arXiv e-prints (Dec","author":"Zhou Bolei","year":"2015","unstructured":"Bolei Zhou , Yuandong Tian , Sainbayar Sukhbaatar , Arthur Szlam , and Rob Fergus . 2015. Simple Baseline for Visual Question Answering. arXiv e-prints (Dec . 2015 ), arXiv:1512.02167. Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. 2015. Simple Baseline for Visual Question Answering. arXiv e-prints (Dec. 2015), arXiv:1512.02167."},{"key":"e_1_3_2_2_48_1","volume-title":"Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings","author":"Zhou Yangyang","year":"2018","unstructured":"Yangyang Zhou , Xin Kang , and Fuji Ren . 2018 . Employing Inception-Resnet-v2 and Bi-LSTM for Medical Domain Visual Question Answering . In Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings , Vol. 2125). CEUR-WS.org, Avignon, France. Yangyang Zhou, Xin Kang, and Fuji Ren. 2018. Employing Inception-Resnet-v2 and Bi-LSTM for Medical Domain Visual Question Answering. In Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 2125). CEUR-WS.org, Avignon, France."}],"event":{"name":"MM '20: The 28th ACM International Conference on Multimedia","location":"Seattle WA USA","acronym":"MM '20","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 28th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3394171.3413761","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3394171.3413761","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:01:16Z","timestamp":1750197676000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3394171.3413761"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,10,12]]},"references-count":48,"alternative-id":["10.1145\/3394171.3413761","10.1145\/3394171"],"URL":"https:\/\/doi.org\/10.1145\/3394171.3413761","relation":{},"subject":[],"published":{"date-parts":[[2020,10,12]]},"assertion":[{"value":"2020-10-12","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}