{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,9]],"date-time":"2025-11-09T07:47:09Z","timestamp":1762674429718,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":23,"publisher":"ACM","license":[{"start":{"date-parts":[[2020,7,25]],"date-time":"2020-07-25T00:00:00Z","timestamp":1595635200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Key Research and Development Program of China","award":["2018AAA0100204"],"award-info":[{"award-number":["2018AAA0100204"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2020,7,25]]},"DOI":"10.1145\/3397271.3401247","type":"proceedings-article","created":{"date-parts":[[2020,7,25]],"date-time":"2020-07-25T07:50:08Z","timestamp":1595663408000},"page":"1781-1784","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Multi-Level Multimodal Transformer Network for Multimodal Recipe Comprehension"],"prefix":"10.1145","author":[{"given":"Ao","family":"Liu","sequence":"first","affiliation":[{"name":"University of Electronic Science and Technology of China, Chengdu, China"}]},{"given":"Shuai","family":"Yuan","sequence":"additional","affiliation":[{"name":"University of Electronic Science and Technology of China, Chengdu, China"}]},{"given":"Chenbin","family":"Zhang","sequence":"additional","affiliation":[{"name":"University of Electronic Science and Technology of China &amp; Tencent, Chengdu, China"}]},{"given":"Congjian","family":"Luo","sequence":"additional","affiliation":[{"name":"University of Electronic Science and Technology of China &amp; Peng Cheng Lab, Chengdu, China"}]},{"given":"Yaqing","family":"Liao","sequence":"additional","affiliation":[{"name":"University of Electronic Science and Technology of China, Chengdu, China"}]},{"given":"Kun","family":"Bai","sequence":"additional","affiliation":[{"name":"Tencent, Beijing, China"}]},{"given":"Zenglin","family":"Xu","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology, Peng Cheng Lab, &amp; University of Electronic Science and Technology of China, Shenzhen, China"}]}],"member":"320","published-online":{"date-parts":[[2020,7,25]]},"reference":[{"key":"e_1_3_2_2_1_1","volume-title":"Vqa: Visual question answering. In ICCV. 2425--2433.","author":"Antol Stanislaw","year":"2015","unstructured":"Stanislaw Antol , Aishwarya Agrawal , Jiasen Lu , Margaret Mitchell , Dhruv Batra , C Lawrence Zitnick , and Devi Parikh . 2015 . Vqa: Visual question answering. In ICCV. 2425--2433. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV. 2425--2433."},{"key":"e_1_3_2_2_2_1","volume-title":"Jamie Ryan Kiros, and Geoffrey E Hinton","author":"Ba Jimmy Lei","year":"2016","unstructured":"Jimmy Lei Ba , Jamie Ryan Kiros, and Geoffrey E Hinton . 2016 . Layer normalization. arXiv preprint arXiv:1607.06450 (2016). Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016)."},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"crossref","unstructured":"Micael Carvalho R\u00e9mi Cad\u00e8ne David Picard Laure Soulier Nicolas Thome and Matthieu Cord. 2018. Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings. In SIGIR. 35--44.  Micael Carvalho R\u00e9mi Cad\u00e8ne David Picard Laure Soulier Nicolas Thome and Matthieu Cord. 2018. Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings. In SIGIR. 35--44.","DOI":"10.1145\/3209978.3210036"},{"key":"e_1_3_2_2_4_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_2_5_1","unstructured":"Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.  Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778."},{"key":"e_1_3_2_2_6_1","volume-title":"Long short-term memory. Neural computation","author":"Hochreiter Sepp","year":"1997","unstructured":"Sepp Hochreiter and J\u00fcrgen Schmidhuber . 1997. Long short-term memory. Neural computation , Vol. 9 , 8 ( 1997 ), 1735--1780. Sepp Hochreiter and J\u00fcrgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780."},{"key":"e_1_3_2_2_7_1","doi-asserted-by":"crossref","unstructured":"Mohit Iyyer Varun Manjunatha Anupam Guha Yogarshi Vyas Jordan Boyd-Graber Hal Daume and Larry S Davis. 2017. The amazing mysteries of the gutter: Drawing inferences between panels in comic book narratives. In CVPR. 7186--7195.  Mohit Iyyer Varun Manjunatha Anupam Guha Yogarshi Vyas Jordan Boyd-Graber Hal Daume and Larry S Davis. 2017. The amazing mysteries of the gutter: Drawing inferences between panels in comic book narratives. In CVPR. 7186--7195.","DOI":"10.1109\/CVPR.2017.686"},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"crossref","unstructured":"Aniruddha Kembhavi Minjoon Seo Dustin Schwenk Jonghyun Choi Ali Farhadi and Hannaneh Hajishirzi. 2017. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In CVPR. 4999--5007.  Aniruddha Kembhavi Minjoon Seo Dustin Schwenk Jonghyun Choi Ali Farhadi and Hannaneh Hajishirzi. 2017. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In CVPR. 4999--5007.","DOI":"10.1109\/CVPR.2017.571"},{"key":"e_1_3_2_2_9_1","volume-title":"arXiv preprint arXiv:1412.6980","author":"Kingma Diederik P","year":"2014","unstructured":"Diederik P Kingma and Jimmy Ba. 2014. arXiv preprint arXiv:1412.6980 ( 2014 ). Diederik P Kingma and Jimmy Ba. 2014. arXiv preprint arXiv:1412.6980 (2014)."},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_3_2_2_11_1","volume-title":"Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557","author":"Li Liunian Harold","year":"2019","unstructured":"Liunian Harold Li , Mark Yatskar , Da Yin , Cho-Jui Hsieh , and Kai-Wei Chang . 2019 . Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019). Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)."},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"crossref","unstructured":"Ao Liu Lizhen Qu Junyu Lu Chenbin Zhang and Zenglin Xu. 2019. Machine Reading Comprehension: Matching and Orders. In CIKM. 2057--2060.  Ao Liu Lizhen Qu Junyu Lu Chenbin Zhang and Zenglin Xu. 2019. Machine Reading Comprehension: Matching and Orders. In CIKM. 2057--2060.","DOI":"10.1145\/3357384.3358139"},{"key":"e_1_3_2_2_13_1","unstructured":"Jiasen Lu Dhruv Batra Devi Parikh and Stefan Lee. 2019 a. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems. 13--23.  Jiasen Lu Dhruv Batra Devi Parikh and Stefan Lee. 2019 a. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems. 13--23."},{"key":"e_1_3_2_2_14_1","volume-title":"Constructing Interpretive Spatio-Temporal Features for Multi-Turn Responses Selection. In ACL'19","author":"Lu Junyu","year":"2019","unstructured":"Junyu Lu , Chenbin Zhang , Zeying Xie , Guang Ling , Tom Chao Zhou , and Zenglin Xu . 2019 b . Constructing Interpretive Spatio-Temporal Features for Multi-Turn Responses Selection. In ACL'19 . 44--50. Junyu Lu, Chenbin Zhang, Zeying Xie, Guang Ling, Tom Chao Zhou, and Zenglin Xu. 2019 b. Constructing Interpretive Spatio-Temporal Features for Multi-Turn Responses Selection. In ACL'19. 44--50."},{"key":"e_1_3_2_2_15_1","volume-title":"Glove: Global vectors for word representation. In EMNLP. 1532--1543.","author":"Pennington Jeffrey","year":"2014","unstructured":"Jeffrey Pennington , Richard Socher , and Christopher Manning . 2014 . Glove: Global vectors for word representation. In EMNLP. 1532--1543. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In EMNLP. 1532--1543."},{"key":"e_1_3_2_2_16_1","volume-title":"100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250","author":"Rajpurkar Pranav","year":"2016","unstructured":"Pranav Rajpurkar , Jian Zhang , Konstantin Lopyrev , and Percy Liang . 2016. Squad : 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 ( 2016 ). Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)."},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"crossref","unstructured":"Amaia Salvador Nicholas Hynes Yusuf Aytar Javier Marin Ferda Ofli Ingmar Weber and Antonio Torralba. 2017. Learning cross-modal embeddings for cooking recipes and food images. In CVPR. 3020--3028.  Amaia Salvador Nicholas Hynes Yusuf Aytar Javier Marin Ferda Ofli Ingmar Weber and Antonio Torralba. 2017. Learning cross-modal embeddings for cooking recipes and food images. In CVPR. 3020--3028.","DOI":"10.1109\/CVPR.2017.327"},{"key":"e_1_3_2_2_18_1","volume-title":"Carl Vondrick, Kevin Murphy, and Cordelia Schmid.","author":"Sun Chen","year":"2019","unstructured":"Chen Sun , Austin Myers , Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019 . Videobert : A joint model for video and language representation learning. In ICCV. 7464--7473. Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A joint model for video and language representation learning. In ICCV. 7464--7473."},{"key":"e_1_3_2_2_19_1","volume-title":"Movieqa: Understanding stories in movies through question-answering. In CVPR. 4631--4640.","author":"Tapaswi Makarand","year":"2016","unstructured":"Makarand Tapaswi , Yukun Zhu , Rainer Stiefelhagen , Antonio Torralba , Raquel Urtasun , and Sanja Fidler . 2016 . Movieqa: Understanding stories in movies through question-answering. In CVPR. 4631--4640. Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. Movieqa: Understanding stories in movies through question-answering. In CVPR. 4631--4640."},{"key":"e_1_3_2_2_20_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.  Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008."},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"crossref","unstructured":"Semih Yagcioglu Aykut Erdem Erkut Erdem and Nazli Ikizler-Cinbis. 2018. RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes. In EMNLP. 1358--1368.  Semih Yagcioglu Aykut Erdem Erkut Erdem and Nazli Ikizler-Cinbis. 2018. RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes. In EMNLP. 1358--1368.","DOI":"10.18653\/v1\/D18-1166"},{"key":"e_1_3_2_2_22_1","volume-title":"Multimodal Transformer with Multi-View Visual Representation for Image Captioning. arXiv preprint arXiv:1905.07841","author":"Yu Jun","year":"2019","unstructured":"Jun Yu , Jing Li , Zhou Yu , and Qingming Huang . 2019. Multimodal Transformer with Multi-View Visual Representation for Image Captioning. arXiv preprint arXiv:1905.07841 ( 2019 ). Jun Yu, Jing Li, Zhou Yu, and Qingming Huang. 2019. Multimodal Transformer with Multi-View Visual Representation for Image Captioning. arXiv preprint arXiv:1905.07841 (2019)."},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"crossref","unstructured":"Chenbin Zhang Congjian Luo Junyu Lu Ao Liu Bing Bai Kun Bai and Zenglin Xu. 2020. Read Attend and Exclude: Multi-Choice Reading Comprehension by Mimicking Human Reasoning Process. In SIGIR.  Chenbin Zhang Congjian Luo Junyu Lu Ao Liu Bing Bai Kun Bai and Zenglin Xu. 2020. Read Attend and Exclude: Multi-Choice Reading Comprehension by Mimicking Human Reasoning Process. In SIGIR.","DOI":"10.1145\/3397271.3401326"}],"event":{"name":"SIGIR '20: The 43rd International ACM SIGIR conference on research and development in Information Retrieval","sponsor":["SIGIR ACM Special Interest Group on Information Retrieval"],"location":"Virtual Event China","acronym":"SIGIR '20"},"container-title":["Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3397271.3401247","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3397271.3401247","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:41:44Z","timestamp":1750200104000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3397271.3401247"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,7,25]]},"references-count":23,"alternative-id":["10.1145\/3397271.3401247","10.1145\/3397271"],"URL":"https:\/\/doi.org\/10.1145\/3397271.3401247","relation":{},"subject":[],"published":{"date-parts":[[2020,7,25]]},"assertion":[{"value":"2020-07-25","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}