{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,6]],"date-time":"2025-12-06T17:17:17Z","timestamp":1765041437910,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":39,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,12,6]],"date-time":"2023-12-06T00:00:00Z","timestamp":1701820800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"the National Natural Science Foundation of China","award":["No. 62072136"],"award-info":[{"award-number":["No. 62072136"]}]},{"name":"the National Key R&D Program of China","award":["No. 2020YFB1710200"],"award-info":[{"award-number":["No. 2020YFB1710200"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,12,6]]},"DOI":"10.1145\/3595916.3626389","type":"proceedings-article","created":{"date-parts":[[2024,1,1]],"date-time":"2024-01-01T16:34:41Z","timestamp":1704126881000},"page":"1-7","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["Cross-modal Image-Recipe Retrieval via Multimodal Fusion"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-6380-6714","authenticated-orcid":false,"given":"Lijie","family":"Li","sequence":"first","affiliation":[{"name":"Harbin Engineering University, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-5891-6267","authenticated-orcid":false,"given":"Caiyue","family":"Hu","sequence":"additional","affiliation":[{"name":"Harbin Engineering University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0291-1372","authenticated-orcid":false,"given":"Haitao","family":"Zhang","sequence":"additional","affiliation":[{"name":"Harbin Engineering University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5208-7043","authenticated-orcid":false,"given":"Akshita","family":"Maradapu Vera Venkata sai","sequence":"additional","affiliation":[{"name":"Towson University, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,1]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.279"},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3209978.3210036"},{"key":"e_1_3_2_1_3_1","volume-title":"MultiMedia Modeling: 23rd International Conference, MMM 2017, Reykjavik, Iceland, January 4-6, 2017, Proceedings, Part I 23","author":"Chen Jingjing","year":"2017","unstructured":"Jingjing Chen , Lei Pang , and Chong-Wah Ngo . 2017 . Cross-modal recipe retrieval: How to cook this dish? . In MultiMedia Modeling: 23rd International Conference, MMM 2017, Reykjavik, Iceland, January 4-6, 2017, Proceedings, Part I 23 . Springer, 588\u2013600. Jingjing Chen, Lei Pang, and Chong-Wah Ngo. 2017. Cross-modal recipe retrieval: How to cook this dish?. In MultiMedia Modeling: 23rd International Conference, MMM 2017, Reykjavik, Iceland, January 4-6, 2017, Proceedings, Part I 23. Springer, 588\u2013600."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240627"},{"key":"e_1_3_2_1_5_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_1_6_1","volume-title":"An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929","author":"Dosovitskiy Alexey","year":"2020","unstructured":"Alexey Dosovitskiy , Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner , Mostafa Dehghani , Matthias Minderer , Georg Heigold , Sylvain Gelly , 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 ( 2020 ). Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3347448.3357163"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01458"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_1_10_1","volume-title":"Long short-term memory. Neural computation 9, 8","author":"Hochreiter Sepp","year":"1997","unstructured":"Sepp Hochreiter and J\u00fcrgen Schmidhuber . 1997. Long short-term memory. Neural computation 9, 8 ( 1997 ), 1735\u20131780. Sepp Hochreiter and J\u00fcrgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735\u20131780."},{"volume-title":"Breakthroughs in statistics: methodology and distribution","author":"Hotelling Harold","key":"e_1_3_2_1_11_1","unstructured":"Harold Hotelling . 1992. Relations between two sets of variates . In Breakthroughs in statistics: methodology and distribution . Springer , 162\u2013190. Harold Hotelling. 1992. Relations between two sets of variates. In Breakthroughs in statistics: methodology and distribution. Springer, 162\u2013190."},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_3_2_1_13_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980","author":"Kingma P","year":"2014","unstructured":"Diederik\u00a0 P Kingma and Jimmy Ba . 2014 . Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). Diederik\u00a0P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01225-0_13"},{"key":"e_1_3_2_1_15_1","volume-title":"International Conference on Machine Learning. PMLR, 12888\u201312900","author":"Li Junnan","year":"2022","unstructured":"Junnan Li , Dongxu Li , Caiming Xiong , and Steven Hoi . 2022 . Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation . In International Conference on Machine Learning. PMLR, 12888\u201312900 . Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888\u201312900."},{"key":"e_1_3_2_1_16_1","volume-title":"Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 34","author":"Li Junnan","year":"2021","unstructured":"Junnan Li , Ramprasaath Selvaraju , Akhilesh Gotmare , Shafiq Joty , Caiming Xiong , and Steven Chu\u00a0Hong Hoi . 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 34 ( 2021 ), 9694\u20139705. Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu\u00a0Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 34 (2021), 9694\u20139705."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3460426.3463618"},{"key":"e_1_3_2_1_18_1","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision. 4654\u20134662","author":"Li Kunpeng","year":"2019","unstructured":"Kunpeng Li , Yulun Zhang , Kai Li , Yuanyuan Li , and Yun Fu . 2019 . Visual semantic reasoning for image-text matching . In Proceedings of the IEEE\/CVF international conference on computer vision. 4654\u20134662 . Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE\/CVF international conference on computer vision. 4654\u20134662."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"crossref","unstructured":"Lijie Li Shuangyang Hu Junhao Chen Ye Wang and Zuobin Xiong. 2022. Exp-SoftLexicon Lattice Model Integrating Radical-Level Features for Chinese NER.. In SEKE. 329\u2013334.  Lijie Li Shuangyang Hu Junhao Chen Ye Wang and Zuobin Xiong. 2022. Exp-SoftLexicon Lattice Model Integrating Radical-Level Features for Chinese NER.. In SEKE. 329\u2013334.","DOI":"10.18293\/SEKE2022-037"},{"key":"e_1_3_2_1_20_1","volume-title":"Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557","author":"Li Liunian\u00a0Harold","year":"2019","unstructured":"Liunian\u00a0Harold Li , Mark Yatskar , Da Yin , Cho-Jui Hsieh , and Kai-Wei Chang . 2019 . Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019). Liunian\u00a0Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)."},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.2969808"},{"key":"e_1_3_2_1_22_1","volume-title":"Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781","author":"Mikolov Tomas","year":"2013","unstructured":"Tomas Mikolov , Kai Chen , Greg Corrado , and Jeffrey Dean . 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 ( 2013 ). Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)."},{"key":"e_1_3_2_1_23_1","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence, Vol.\u00a035","author":"Pham X","year":"2021","unstructured":"Hai\u00a0 X Pham , Ricardo Guerrero , Vladimir Pavlovic , and Jiatong Li . 2021 . CHEF: cross-modal hierarchical embeddings for food domain retrieval . In Proceedings of the AAAI Conference on Artificial Intelligence, Vol.\u00a035 . 2423\u20132430. Hai\u00a0X Pham, Ricardo Guerrero, Vladimir Pavlovic, and Jiatong Li. 2021. CHEF: cross-modal hierarchical embeddings for food domain retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol.\u00a035. 2423\u20132430."},{"key":"e_1_3_2_1_24_1","volume-title":"Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966","author":"Qi Di","year":"2020","unstructured":"Di Qi , Lin Su , Jia Song , Edward Cui , Taroon Bharti , and Arun Sacheti . 2020 . Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966 (2020). Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. 2020. Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966 (2020)."},{"key":"e_1_3_2_1_25_1","volume-title":"International conference on machine learning. PMLR, 8748\u20138763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford , Jong\u00a0Wook Kim , Chris Hallacy , Aditya Ramesh , Gabriel Goh , Sandhini Agarwal , Girish Sastry , Amanda Askell , Pamela Mishkin , Jack Clark , 2021 . Learning transferable visual models from natural language supervision . In International conference on machine learning. PMLR, 8748\u20138763 . Alec Radford, Jong\u00a0Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748\u20138763."},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01522"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.327"},{"key":"e_1_3_2_1_28_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 4567\u20134578","author":"Shukor Mustafa","year":"2022","unstructured":"Mustafa Shukor , Guillaume Couairon , Asya Grechka , and Matthieu Cord . 2022 . Transformer decoders with multimodal regularization for cross-modal food retrieval . In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 4567\u20134578 . Mustafa Shukor, Guillaume Couairon, Asya Grechka, and Matthieu Cord. 2022. Transformer decoders with multimodal regularization for cross-modal food retrieval. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 4567\u20134578."},{"key":"e_1_3_2_1_29_1","volume-title":"Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman . 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 ( 2014 ). Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)."},{"key":"e_1_3_2_1_30_1","volume-title":"Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530","author":"Su Weijie","year":"2019","unstructured":"Weijie Su , Xizhou Zhu , Yue Cao , Bin Li , Lewei Lu , Furu Wei , and Jifeng Dai . 2019 . Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019). Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)."},{"key":"e_1_3_2_1_31_1","volume-title":"A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491","author":"Suhr Alane","year":"2018","unstructured":"Alane Suhr , Stephanie Zhou , Ally Zhang , Iris Zhang , Huajun Bai , and Yoav Artzi . 2018. A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491 ( 2018 ). Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2018. A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491 (2018)."},{"key":"e_1_3_2_1_32_1","volume-title":"Attention is all you need. Advances in neural information processing systems 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan\u00a0 N Gomez , \u0141ukasz Kaiser , and Illia Polosukhin . 2017. Attention is all you need. Advances in neural information processing systems 30 ( 2017 ). Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan\u00a0N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_2_1_33_1","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. 11572\u201311581","author":"Wang Hao","year":"2019","unstructured":"Hao Wang , Doyen Sahoo , Chenghao Liu , Ee-peng Lim, and Steven\u00a0 CH Hoi . 2019 . Learning cross-modal embeddings with adversarial networks for cooking recipes and food images . In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. 11572\u201311581 . Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-peng Lim, and Steven\u00a0CH Hoi. 2019. Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. 11572\u201311581."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"crossref","first-page":"2515","DOI":"10.1109\/TMM.2021.3083109","article-title":"Cross-modal food retrieval: learning a joint embedding of food images and recipes with semantic consistency and attention mechanism","volume":"24","author":"Wang Hao","year":"2021","unstructured":"Hao Wang , Doyen Sahoo , Chenghao Liu , Ke Shu , Palakorn Achananuparp , Ee-peng Lim, and Steven\u00a0 CH Hoi . 2021 . Cross-modal food retrieval: learning a joint embedding of food images and recipes with semantic consistency and attention mechanism . IEEE Transactions on Multimedia 24 (2021), 2515 \u2013 2525 . Hao Wang, Doyen Sahoo, Chenghao Liu, Ke Shu, Palakorn Achananuparp, Ee-peng Lim, and Steven\u00a0CH Hoi. 2021. Cross-modal food retrieval: learning a joint embedding of food images and recipes with semantic consistency and attention mechanism. IEEE Transactions on Multimedia 24 (2021), 2515\u20132525.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_2_1_35_1","volume-title":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2221\u20132230","author":"Xie Zhongwei","year":"2021","unstructured":"Zhongwei Xie , Ling Liu , Lin Li , and Luo Zhong . 2021 . Learning joint embedding with modality alignments for cross-modal retrieval of recipes and food images . In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2221\u20132230 . Zhongwei Xie, Ling Liu, Lin Li, and Luo Zhong. 2021. Learning joint embedding with modality alignments for cross-modal retrieval of recipes and food images. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2221\u20132230."},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/TII.2020.3032352"},{"key":"e_1_3_2_1_37_1","volume-title":"2019 IEEE International Conference on Data Mining (ICDM). IEEE, 668\u2013677","author":"Xiong Zuobin","year":"2019","unstructured":"Zuobin Xiong , Wei Li , Qilong Han , and Zhipeng Cai . 2019 . Privacy-preserving auto-driving: a GAN-based approach to protect vehicular camera data . In 2019 IEEE International Conference on Data Mining (ICDM). IEEE, 668\u2013677 . Zuobin Xiong, Wei Li, Qilong Han, and Zhipeng Cai. 2019. Privacy-preserving auto-driving: a GAN-based approach to protect vehicular camera data. In 2019 IEEE International Conference on Data Mining (ICDM). IEEE, 668\u2013677."},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/TVT.2021.3061065"},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2020.2967597"}],"event":{"name":"MMAsia '23: ACM Multimedia Asia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Tainan Taiwan","acronym":"MMAsia '23"},"container-title":["ACM Multimedia Asia 2023"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3595916.3626389","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3595916.3626389","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T22:48:40Z","timestamp":1750286920000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3595916.3626389"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,12,6]]},"references-count":39,"alternative-id":["10.1145\/3595916.3626389","10.1145\/3595916"],"URL":"https:\/\/doi.org\/10.1145\/3595916.3626389","relation":{},"subject":[],"published":{"date-parts":[[2023,12,6]]},"assertion":[{"value":"2024-01-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}