{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T15:54:35Z","timestamp":1781538875502,"version":"3.54.5"},"publisher-location":"New York, NY, USA","reference-count":56,"publisher":"ACM","license":[{"start":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T00:00:00Z","timestamp":1781481600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/legalcode"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,6,16]]},"DOI":"10.1145\/3805622.3810595","type":"proceedings-article","created":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T14:42:57Z","timestamp":1781534577000},"page":"1241-1250","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-9403-4462","authenticated-orcid":false,"given":"Zishen","family":"Qu","sequence":"first","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-8565-3119","authenticated-orcid":false,"given":"Xuesong","family":"Li","sequence":"additional","affiliation":[{"name":"bytedance, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7150-5506","authenticated-orcid":false,"given":"Hongwei","family":"Kang","sequence":"additional","affiliation":[{"name":"bytedance, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-5154-2004","authenticated-orcid":false,"given":"Haijian","family":"Gu","sequence":"additional","affiliation":[{"name":"Southeast University, Nanjing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-5498-3498","authenticated-orcid":false,"given":"Quan","family":"Meng","sequence":"additional","affiliation":[{"name":"bytedance, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7743-3822","authenticated-orcid":false,"given":"Tianrui","family":"Niu","sequence":"additional","affiliation":[{"name":"bytedance, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-7025-1820","authenticated-orcid":false,"given":"Xin","family":"Yang","sequence":"additional","affiliation":[{"name":"bytedance, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-5171-7078","authenticated-orcid":false,"given":"Ruidong","family":"Pan","sequence":"additional","affiliation":[{"name":"bytedance, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2026,6,15]]},"reference":[{"key":"e_1_3_3_1_2_2","unstructured":"Jason Baldridge Jakob Bauer Mukul Bhutani Nicole Brichtova Andrew Bunner Lluis Castrejon Kelvin Chan Yichang Chen Sander Dieleman Yuqing Du et\u00a0al. 2024. Imagen 3. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2408.07009 (2024)."},{"key":"e_1_3_3_1_3_2","unstructured":"Dmitry Baranchuk Ivan Rubachev Andrey Voynov Valentin Khrulkov and Artem Babenko. 2021. Label-efficient semantic segmentation with diffusion models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2112.03126 (2021)."},{"key":"e_1_3_3_1_4_2","unstructured":"Black Forest Labs. 2024. Flux: Official inference repository for flux.1 models. Accessed: 2025-02-07."},{"key":"e_1_3_3_1_5_2","unstructured":"Keqin Chen Zhao Zhang Weili Zeng Richong Zhang Feng Zhu and Rui Zhao. 2023. Shikra: Unleashing multimodal llm\u2019s referential dialogue magic. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2306.15195 (2023)."},{"key":"e_1_3_3_1_6_2","unstructured":"Barbara\u00a0Toniella Corradini Mustafa Shukor Paul Couairon Guillaume Couairon Franco Scarselli and Matthieu Cord. 2024. Freeseg-diff: Training-free open-vocabulary segmentation with diffusion models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2403.20105 (2024)."},{"key":"e_1_3_3_1_7_2","first-page":"8780","volume-title":"NeurIPS","author":"Dhariwal Prafulla","year":"2021","unstructured":"Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. In NeurIPS , Vol.\u00a034. 8780\u20138794."},{"key":"e_1_3_3_1_8_2","first-page":"19822","volume-title":"NeurIPS","author":"Ding Ming","year":"2021","unstructured":"Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et\u00a0al. 2021. Cogview: Mastering text-to-image generation via transformers. In NeurIPS , Vol.\u00a034. 19822\u201319835."},{"key":"e_1_3_3_1_9_2","volume-title":"ICML","author":"Esser Patrick","year":"2024","unstructured":"Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M\u00fcller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et\u00a0al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. In ICML."},{"key":"e_1_3_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01268"},{"key":"e_1_3_3_1_11_2","doi-asserted-by":"crossref","unstructured":"Li Fei-Fei Robert Fergus and Pietro Perona. 2006. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28 4 (2006) 594\u2013611.","DOI":"10.1109\/TPAMI.2006.79"},{"key":"e_1_3_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-19784-0_6"},{"key":"e_1_3_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01043"},{"key":"e_1_3_3_1_14_2","unstructured":"Edward\u00a0J Hu Yelong Shen Phillip Wallis Zeyuan Allen-Zhu Yuanzhi Li Shean Wang Lu Wang Weizhu Chen et\u00a0al. 2022. Lora: Low-rank adaptation of large language models.ICLR 1 2 (2022) 3."},{"key":"e_1_3_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46448-0_7"},{"key":"e_1_3_3_1_16_2","unstructured":"Laurynas Karazija Iro Laina Andrea Vedaldi and Christian Rupprecht. 2023. Diffusion models for zero-shot open-vocabulary segmentation. arXiv e-prints (2023) arXiv\u20132306."},{"key":"e_1_3_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00453"},{"key":"e_1_3_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1086"},{"key":"e_1_3_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00371"},{"key":"e_1_3_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00915"},{"key":"e_1_3_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00602"},{"key":"e_1_3_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00705"},{"key":"e_1_3_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00695"},{"key":"e_1_3_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01252-6_39"},{"key":"e_1_3_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46493-0_48"},{"key":"e_1_3_3_1_26_2","unstructured":"Alex Nichol Prafulla Dhariwal Aditya Ramesh Pranav Shyam Pamela Mishkin Bob McGrew Ilya Sutskever and Mark Chen. 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2112.10741 (2021)."},{"key":"e_1_3_3_1_27_2","first-page":"8162","volume-title":"ICML","author":"Nichol Alexander\u00a0Quinn","year":"2021","unstructured":"Alexander\u00a0Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In ICML. PMLR, 8162\u20138171."},{"key":"e_1_3_3_1_28_2","volume-title":"ICLR","author":"Pang Ziqi","year":"2025","unstructured":"Ziqi Pang, Xin Xu, and Yu-Xiong Wang. 2025. Aligning generative denoising with discriminative objectives unleashes diffusion for visual perception. In ICLR."},{"key":"e_1_3_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00387"},{"key":"e_1_3_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00384"},{"key":"e_1_3_3_1_31_2","unstructured":"Dustin Podell Zion English Kyle Lacey Andreas Blattmann Tim Dockhorn Jonas M\u00fcller Joe Penna and Robin Rombach. 2023. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2307.01952 (2023)."},{"key":"e_1_3_3_1_32_2","first-page":"8748","volume-title":"ICML","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong\u00a0Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et\u00a0al. 2021. Learning transferable visual models from natural language supervision. In ICML. 8748\u20138763."},{"key":"e_1_3_3_1_33_2","first-page":"8821","volume-title":"ICML","author":"Ramesh Aditya","year":"2021","unstructured":"Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In ICML. 8821\u20138831."},{"key":"e_1_3_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01236"},{"key":"e_1_3_3_1_35_2","volume-title":"NeurIPS","author":"Razavi Ali","year":"2019","unstructured":"Ali Razavi, Aaron Van\u00a0den Oord, and Oriol Vinyals. 2019. Generating diverse high-fidelity images with vq-vae-2. In NeurIPS , Vol.\u00a032."},{"key":"e_1_3_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02491"},{"key":"e_1_3_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_3_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01231-1_3"},{"key":"e_1_3_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52734.2025.00020"},{"key":"e_1_3_3_1_40_2","unstructured":"Zhicong Tang Shuyang Gu Jianmin Bao Dong Chen and Fang Wen. 2022. Improved vector quantized diffusion models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2205.16007 (2022)."},{"key":"e_1_3_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01602"},{"key":"e_1_3_3_1_42_2","volume-title":"NeurIPS","author":"Den\u00a0Oord Aaron Van","year":"2017","unstructured":"Aaron Van Den\u00a0Oord, Oriol Vinyals, et\u00a0al. 2017. Neural discrete representation learning. In NeurIPS , Vol.\u00a030."},{"key":"e_1_3_3_1_43_2","first-page":"138981","volume-title":"NeurIPS","author":"Wang Chaoyang","year":"2024","unstructured":"Chaoyang Wang, Xiangtai Li, Lu Qi, Henghui Ding, Yunhai Tong, and Ming-Hsuan Yang. 2024. Semflow: Binding semantic segmentation and image synthesis via rectified flow. In NeurIPS , Vol.\u00a037. 138981\u2013139001."},{"key":"e_1_3_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01139"},{"key":"e_1_3_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01023"},{"key":"e_1_3_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00143"},{"key":"e_1_3_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.5244\/C.35.137"},{"key":"e_1_3_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01075"},{"key":"e_1_3_3_1_49_2","unstructured":"Haoxuan You Haotian Zhang Zhe Gan Xianzhi Du Bowen Zhang Zirui Wang Liangliang Cao Shih-Fu Chang and Yinfei Yang. 2023. Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2310.07704 (2023)."},{"key":"e_1_3_3_1_50_2","unstructured":"Fisher Yu Ari Seff Yinda Zhang Shuran Song Thomas Funkhouser and Jianxiong Xiao. 2015. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1506.03365 (2015)."},{"key":"e_1_3_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00142"},{"key":"e_1_3_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00089"},{"key":"e_1_3_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00527"},{"key":"e_1_3_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00527"},{"key":"e_1_3_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01738"},{"key":"e_1_3_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00595"},{"key":"e_1_3_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.52202\/075280-0868"}],"event":{"name":"ICMR '26: International Conference on Multimedia Retrieval","location":"Amsterdam The Netherlands","acronym":"ICMR '26","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 2026 International Conference on Multimedia Retrieval"],"original-title":[],"deposited":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T15:04:14Z","timestamp":1781535854000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3805622.3810595"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,6,15]]},"references-count":56,"alternative-id":["10.1145\/3805622.3810595","10.1145\/3805622"],"URL":"https:\/\/doi.org\/10.1145\/3805622.3810595","relation":{},"subject":[],"published":{"date-parts":[[2026,6,15]]},"assertion":[{"value":"2026-06-15","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}