{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:13:02Z","timestamp":1750219982379,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":26,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"JSPS KAKENHI","award":["21H05812, 22H00540, 22H00548, and 22K19808"],"award-info":[{"award-number":["21H05812, 22H00540, 22H00548, and 22K19808"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3552484.3555751","type":"proceedings-article","created":{"date-parts":[[2022,10,24]],"date-time":"2022-10-24T22:43:36Z","timestamp":1666651416000},"page":"29-37","source":"Crossref","is-referenced-by-count":0,"title":["Text-based Image Editing for Food Images with CLIP"],"prefix":"10.1145","author":[{"given":"Kohei","family":"Yamamoto","sequence":"first","affiliation":[{"name":"The University of Electro-Communications, Chofu-shi, Tokyo, Japan"}]},{"given":"Keiji","family":"Yanai","sequence":"additional","affiliation":[{"name":"The University of Electro-Communications, Chofu-shi, Tokyo, Japan"}]}],"member":"320","published-online":{"date-parts":[[2022,10,24]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Proc.of Advances in Neural Information Processing Systems","author":"Goodfellow Ian","year":"2014","unstructured":"Ian Goodfellow , Jean Pouget-Abadie , Mehdi Mirza , Bing Xu , David Warde-Farley , Sherjil Ozair , Aaron Courville , and Yoshua Bengio . Generative adversarial nets . In Proc.of Advances in Neural Information Processing Systems , 2014 . Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proc.of Advances in Neural Information Processing Systems, 2014."},{"key":"e_1_3_2_1_2_1","first-page":"7880","volume-title":"Proc.of IEEE Computer Vision and Pattern Recognition","author":"Li Bowen","year":"2020","unstructured":"Bowen Li , Xiaojuan Qi , Thomas Lukasiewicz , and Philip HS Torr . Manigan : Text-guided image manipulation . In Proc.of IEEE Computer Vision and Pattern Recognition , pages 7880 -- 7889 , 2020 . Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip HS Torr. Manigan: Text-guided image manipulation. In Proc.of IEEE Computer Vision and Pattern Recognition, pages 7880--7889, 2020."},{"key":"e_1_3_2_1_3_1","volume-title":"Styleclip: Text-driven manipulation of stylegan imagery. In arXiv preprint arXiv:2103.17249","author":"Patashnik Or","year":"2021","unstructured":"Or Patashnik , Zongze Wu , Eli Shechtman , Daniel Cohen-Or , and Dani Lischinski . Styleclip: Text-driven manipulation of stylegan imagery. In arXiv preprint arXiv:2103.17249 , 2021 . Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. In arXiv preprint arXiv:2103.17249, 2021."},{"key":"e_1_3_2_1_4_1","volume-title":"Stylegan-nada: Clip-guided domain adaptation of image generators. In arXiv preprint arXiv:2108.00946","author":"Gal Rinon","year":"2021","unstructured":"Rinon Gal , Or Patashnik , Haggai Maron , Gal Chechik , and Daniel Cohen-Or . Stylegan-nada: Clip-guided domain adaptation of image generators. In arXiv preprint arXiv:2108.00946 , 2021 . Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip-guided domain adaptation of image generators. In arXiv preprint arXiv:2108.00946, 2021."},{"key":"e_1_3_2_1_5_1","volume-title":"Vqgan-clip: Open domain image generation and editing with natural language guidance. arXiv preprint arXiv:2204.08583","author":"Crowson Katherine","year":"2022","unstructured":"Katherine Crowson , Stella Biderman , Daniel Kornis , Dashiell Stander , Eric Hallahan , Louis Castricato , and Edward Raff . Vqgan-clip: Open domain image generation and editing with natural language guidance. arXiv preprint arXiv:2204.08583 , 2022 . Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. Vqgan-clip: Open domain image generation and editing with natural language guidance. arXiv preprint arXiv:2204.08583, 2022."},{"key":"e_1_3_2_1_6_1","first-page":"2256","volume-title":"Proc.of IEEE Computer Vision and Pattern Recognition","author":"Xia Weihao","year":"2021","unstructured":"Weihao Xia , Yujiu Yang , Jing-Hao Xue , and Baoyuan Wu. Tedigan : Text-guided diverse face image generation and manipulation . In Proc.of IEEE Computer Vision and Pattern Recognition , pages 2256 -- 2265 , 2021 . Weihao Xia, Yujiu Yang, Jing-Hao Xue, and Baoyuan Wu. Tedigan: Text-guided diverse face image generation and manipulation. In Proc.of IEEE Computer Vision and Pattern Recognition, pages 2256--2265, 2021."},{"key":"e_1_3_2_1_7_1","first-page":"4401","volume-title":"Proc.of IEEE Computer Vision and Pattern Recognition","author":"Karras Tero","year":"2019","unstructured":"Tero Karras , Samuli Laine , and Timo Aila . A style-based generator architecture for generative adversarial networks . In Proc.of IEEE Computer Vision and Pattern Recognition , pages 4401 -- 4410 , 2019 . Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proc.of IEEE Computer Vision and Pattern Recognition, pages 4401--4410, 2019."},{"key":"e_1_3_2_1_8_1","unstructured":"Alec Radford Jong Wook Kim Chris Hallacy Aditya Ramesh Gabriel Goh Sandhini Agarwal Girish Sastry Amanda Askell Pamela Mishkin Jack Clark et al. Learning transferable visual models from natural language supervision. In arXiv preprint arXiv:2103.00020 2021.  Alec Radford Jong Wook Kim Chris Hallacy Aditya Ramesh Gabriel Goh Sandhini Agarwal Girish Sastry Amanda Askell Pamela Mishkin Jack Clark et al. Learning transferable visual models from natural language supervision. In arXiv preprint arXiv:2103.00020 2021."},{"key":"e_1_3_2_1_9_1","volume-title":"Paint by word. In arXiv preprint arXiv:2103.10951","author":"Bau David","year":"2021","unstructured":"David Bau , Alex Andonian , Audrey Cui , YeonHwan Park , Ali Jahanian , Aude Oliva , and Antonio Torralba . Paint by word. In arXiv preprint arXiv:2103.10951 , 2021 . David Bau, Alex Andonian, Audrey Cui, YeonHwan Park, Ali Jahanian, Aude Oliva, and Antonio Torralba. Paint by word. In arXiv preprint arXiv:2103.10951, 2021."},{"key":"e_1_3_2_1_10_1","volume-title":"Large scale gan training for high fidelity natural image synthesis. In arXiv preprint arXiv:1809.11096","author":"Brock Andrew","year":"2018","unstructured":"Andrew Brock , Jeff Donahue , and Karen Simonyan . Large scale gan training for high fidelity natural image synthesis. In arXiv preprint arXiv:1809.11096 , 2018 . Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In arXiv preprint arXiv:1809.11096, 2018."},{"key":"e_1_3_2_1_11_1","first-page":"12873","volume-title":"Proc.of IEEE Computer Vision and Pattern Recognition","author":"Esser Patrick","year":"2021","unstructured":"Patrick Esser , Robin Rombach , and Bjorn Ommer . Taming transformers for high-resolution image synthesis . In Proc.of IEEE Computer Vision and Pattern Recognition , pages 12873 -- 12883 , 2021 . Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proc.of IEEE Computer Vision and Pattern Recognition, pages 12873--12883, 2021."},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-46805-6_19"},{"key":"e_1_3_2_1_13_1","first-page":"5998","volume-title":"Proc.of Advances in Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N Gomez , Lukasz Kaiser , and Illia Polosukhin . Attention is all you need . In Proc.of Advances in Neural Information Processing Systems , pages 5998 -- 6008 , 2017 . Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc.of Advances in Neural Information Processing Systems, pages 5998--6008, 2017."},{"key":"e_1_3_2_1_14_1","first-page":"3020","volume-title":"Proc.of IEEE Computer Vision and Pattern Recognition","author":"Salvador Amaia","year":"2017","unstructured":"Amaia Salvador , Nicholas Hynes , Yusuf Aytar , Javier Marin , Ferda Ofli , Ingmar Weber , and Antonio Torralba . Learning cross-modal embeddings for cooking recipes and food images . In Proc.of IEEE Computer Vision and Pattern Recognition , pages 3020 -- 3028 , 2017 . Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba. Learning cross-modal embeddings for cooking recipes and food images. In Proc.of IEEE Computer Vision and Pattern Recognition, pages 3020--3028, 2017."},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"crossref","first-page":"1244","DOI":"10.1145\/3240508.3241391","volume-title":"Proc.of ACM International Conference Multimedia","author":"Tanno Ryosuke","year":"2018","unstructured":"Ryosuke Tanno , Daichi Horita , Wataru Shimoda , and Keiji Yanai . Magical rice bowl: A real-time food category changer . In Proc.of ACM International Conference Multimedia , pages 1244 -- 1246 , 2018 . Ryosuke Tanno, Daichi Horita, Wataru Shimoda, and Keiji Yanai. Magical rice bowl: A real-time food category changer. In Proc.of ACM International Conference Multimedia, pages 1244--1246, 2018."},{"key":"e_1_3_2_1_16_1","volume-title":"Foodx-251: A dataset for fine-grained food classification. In arXiv preprint arXiv:1907.06167","author":"Kaur Parneet","year":"2019","unstructured":"Parneet Kaur , Karan Sikka , Weijun Wang , Serge Belongie , and Ajay Divakaran . Foodx-251: A dataset for fine-grained food classification. In arXiv preprint arXiv:1907.06167 , 2019 . Parneet Kaur, Karan Sikka, Weijun Wang, Serge Belongie, and Ajay Divakaran. Foodx-251: A dataset for fine-grained food classification. In arXiv preprint arXiv:1907.06167, 2019."},{"key":"e_1_3_2_1_17_1","first-page":"393","volume-title":"Proc.of ACM International Conference Multimedia","author":"Min Weiqing","year":"2020","unstructured":"Weiqing Min , Linhu Liu , Zhiling Wang , Zhengdong Luo , Xiaoming Wei , Xiaolin Wei , and Shuqiang Jiang . Isia food-500 : A dataset for large-scale food recognition via stacked global-local attention network . In Proc.of ACM International Conference Multimedia , pages 393 -- 401 , 2020 . Weiqing Min, Linhu Liu, Zhiling Wang, Zhengdong Luo, Xiaoming Wei, Xiaolin Wei, and Shuqiang Jiang. Isia food-500: A dataset for large-scale food recognition via stacked global-local attention network. In Proc.of ACM International Conference Multimedia, pages 393--401, 2020."},{"key":"e_1_3_2_1_18_1","volume-title":"Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. In arXiv preprint arXiv:2109.01134","author":"Zhou Kaiyang","year":"2021","unstructured":"Kaiyang Zhou , Jingkang Yang , Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. In arXiv preprint arXiv:2109.01134 , 2021 . Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. In arXiv preprint arXiv:2109.01134, 2021."},{"key":"e_1_3_2_1_19_1","first-page":"29","article-title":"Improved techniques for training gans","author":"Salimans Tim","year":"2016","unstructured":"Tim Salimans , Ian Goodfellow , Wojciech Zaremba , Vicki Cheung , Alec Radford , and Xi Chen . Improved techniques for training gans . Proc.of Advances in Neural Information Processing Systems , 29 , 2016 . Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Proc.of Advances in Neural Information Processing Systems, 29, 2016.","journal-title":"Proc.of Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_20_1","first-page":"30","article-title":"Gans trained by a two time-scale update rule converge to a local nash equilibrium","author":"Heusel Martin","year":"2017","unstructured":"Martin Heusel , Hubert Ramsauer , Thomas Unterthiner , Bernhard Nessler , and Sepp Hochreiter . Gans trained by a two time-scale update rule converge to a local nash equilibrium . Proc.of Advances in Neural Information Processing Systems , 30 , 2017 . Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Proc.of Advances in Neural Information Processing Systems, 30, 2017.","journal-title":"Proc.of Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_21_1","volume-title":"Demystifying mmd gans. arXiv preprint arXiv:1801.01401","author":"Bickowski Mikolaj","year":"2018","unstructured":"Mikolaj Bickowski , Danica J Sutherland , Michael Arbel , and Arthur Gretton . Demystifying mmd gans. arXiv preprint arXiv:1801.01401 , 2018 . Mikolaj Bickowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018."},{"key":"e_1_3_2_1_22_1","volume-title":"Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718","author":"Hessel Jack","year":"2021","unstructured":"Jack Hessel , Ari Holtzman , Maxwell Forbes , Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 , 2021 . Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021."},{"key":"e_1_3_2_1_23_1","first-page":"770","volume-title":"Proc.of IEEE Computer Vision and Pattern Recognition","author":"He Kaiming","year":"2016","unstructured":"Kaiming He , Xiangyu Zhang , Shaoqing Ren , and Jian Sun . Deep residual learning for image recognition . In Proc.of IEEE Computer Vision and Pattern Recognition , pages 770 -- 778 , 2016 . Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc.of IEEE Computer Vision and Pattern Recognition, pages 770--778, 2016."},{"key":"e_1_3_2_1_24_1","volume-title":"Decoupled weight decay regularization. In arXiv preprint arXiv:1711.05101","author":"Loshchilov Ilya","year":"2017","unstructured":"Ilya Loshchilov and Frank Hutter . Decoupled weight decay regularization. In arXiv preprint arXiv:1711.05101 , 2017 . Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In arXiv preprint arXiv:1711.05101, 2017."},{"key":"e_1_3_2_1_25_1","first-page":"647","volume-title":"Proc.of International Conference on Pattern Recognition","author":"Okamoto Kaimu","year":"2021","unstructured":"Kaimu Okamoto and Keiji Yanai . Uec-foodpix complete : A large-scale food image segmentation dataset . In Proc.of International Conference on Pattern Recognition , pages 647 -- 659 . Springer , 2021 . Kaimu Okamoto and Keiji Yanai. Uec-foodpix complete: A large-scale food image segmentation dataset. In Proc.of International Conference on Pattern Recognition, pages 647--659. Springer, 2021."},{"key":"e_1_3_2_1_26_1","first-page":"6840","article-title":"Denoising diffusion probabilistic models","volume":"33","author":"Ho Jonathan","year":"2020","unstructured":"Jonathan Ho , Ajay Jain , and Pieter Abbeel . Denoising diffusion probabilistic models . Proc.of Advances in Neural Information Processing Systems , 33 : 6840 -- 6851 , 2020 Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Proc.of Advances in Neural Information Processing Systems, 33:6840--6851, 2020","journal-title":"Proc.of Advances in Neural Information Processing Systems"}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Lisboa Portugal","acronym":"MM '22"},"container-title":["Proceedings of the 7th International Workshop on Multimedia Assisted Dietary Management on Multimedia Assisted Dietary Management"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3552484.3555751","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3552484.3555751","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:26Z","timestamp":1750182566000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3552484.3555751"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":26,"alternative-id":["10.1145\/3552484.3555751","10.1145\/3552484"],"URL":"https:\/\/doi.org\/10.1145\/3552484.3555751","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]}}}