{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:19:25Z","timestamp":1750220365126,"version":"3.41.0"},"reference-count":47,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2021,6,30]],"date-time":"2021-06-30T00:00:00Z","timestamp":1625011200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Graph."],"published-print":{"date-parts":[[2021,6,30]]},"abstract":"<jats:p>Nowadays, as cameras are rapidly adopted in our daily routine, images of documents are becoming both abundant and prevalent. Unlike natural images that capture physical objects, document-images contain a significant amount of text with critical semantics and complicated layouts. In this work, we devise a generic unsupervised technique to learn multimodal affinities between textual entities in a document-image, considering their visual style, the content of their underlying text, and their geometric context within the image. We then use these learned affinities to automatically<\/jats:p><jats:p>cluster the textual entities in the image into different semantic groups. The core of our approach is a deep optimization scheme dedicated for an image provided by the user that detects and leverages reliable pairwise connections in the multimodal representation of the textual elements to properly learn the affinities. We show that our technique can operate on highly varying images spanning a wide range of documents and demonstrate its applicability for various editing operations manipulating the content, appearance, and geometry of the image.<\/jats:p>","DOI":"10.1145\/3451340","type":"journal-article","created":{"date-parts":[[2021,7,15]],"date-time":"2021-07-15T19:43:39Z","timestamp":1626378219000},"page":"1-16","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Learning Multimodal Affinities for Textual Editing in Images"],"prefix":"10.1145","volume":"40","author":[{"given":"Or","family":"Perel","sequence":"first","affiliation":[{"name":"Amazon Web Services, Israel"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Oron","family":"Anschel","sequence":"additional","affiliation":[{"name":"Amazon Web Services, Israel"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Omri","family":"Ben-Eliezer","sequence":"additional","affiliation":[{"name":"Harvard University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shai","family":"Mazor","sequence":"additional","affiliation":[{"name":"Amazon Web Services, Israel"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hadar","family":"Averbuch-Elor","sequence":"additional","affiliation":[{"name":"Cornell-Tech, Cornell University, NY"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,7,15]]},"reference":[{"doi-asserted-by":"publisher","key":"e_1_2_1_1_1","DOI":"10.1109\/CVPR.2018.00522"},{"key":"e_1_2_1_2_1","volume-title":"Proceedings of the 27th International Conference on Computational Linguistics (COLING\u201918)","author":"Akbik Alan","year":"2018","unstructured":"Alan Akbik , Duncan Blythe , and Roland Vollgraf . 2018 . Contextual string embeddings for sequence labeling . In Proceedings of the 27th International Conference on Computational Linguistics (COLING\u201918) . 1638\u20131649. Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics (COLING\u201918). 1638\u20131649."},{"doi-asserted-by":"publisher","key":"e_1_2_1_3_1","DOI":"10.1016\/0031-3203(90)90112-X"},{"doi-asserted-by":"publisher","key":"e_1_2_1_4_1","DOI":"10.1145\/1399504.1360639"},{"doi-asserted-by":"publisher","key":"e_1_2_1_5_1","DOI":"10.1109\/ICCV.2015.279"},{"doi-asserted-by":"publisher","key":"e_1_2_1_6_1","DOI":"10.5555\/573128"},{"doi-asserted-by":"publisher","key":"e_1_2_1_7_1","DOI":"10.1145\/1576246.1531330"},{"doi-asserted-by":"publisher","key":"e_1_2_1_8_1","DOI":"10.5555\/647798.736824"},{"key":"e_1_2_1_9_1","volume-title":"Proceedings of the Symposium on Document Image Understanding Technology.","author":"Breuel Thomas M.","year":"2003","unstructured":"Thomas M. Breuel . 2003 . High performance document layout analysis . In Proceedings of the Symposium on Document Image Understanding Technology. Thomas M. Breuel. 2003. High performance document layout analysis. In Proceedings of the Symposium on Document Image Understanding Technology."},{"unstructured":"Zoya Bylinskii Sami Alsheikh Spandan Madan Adria Recasens Kimberli Zhong Hanspeter Pfister Fredo Durand and Aude Oliva. 2017. Understanding infographics through textual and visual tag prediction. arXiv preprint arXiv:1709.09215. Zoya Bylinskii Sami Alsheikh Spandan Madan Adria Recasens Kimberli Zhong Hanspeter Pfister Fredo Durand and Aude Oliva. 2017. Understanding infographics through textual and visual tag prediction. arXiv preprint arXiv:1709.09215.","key":"e_1_2_1_10_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_11_1","DOI":"10.1145\/2366145.2366151"},{"key":"e_1_2_1_12_1","doi-asserted-by":"crossref","first-page":"917","DOI":"10.1109\/TVCG.2019.2934810","article-title":"Towards automated infographic design: Deep learning-based auto-extraction of extensible timeline","volume":"26","author":"Chen Zhutian","year":"2020","unstructured":"Zhutian Chen , Yun Wang , Qianwen Wang , Yong Wang , and Huamin Qu . 2020 . Towards automated infographic design: Deep learning-based auto-extraction of extensible timeline . IEEE Trans. Vis. Comput. Graph. 26 , 1 (2020), 917 \u2013 926 . Zhutian Chen, Yun Wang, Qianwen Wang, Yong Wang, and Huamin Qu. 2020. Towards automated infographic design: Deep learning-based auto-extraction of extensible timeline. IEEE Trans. Vis. Comput. Graph. 26, 1 (2020), 917\u2013926.","journal-title":"IEEE Trans. Vis. Comput. Graph."},{"doi-asserted-by":"publisher","key":"e_1_2_1_13_1","DOI":"10.1109\/ICDAR.2015.7333941"},{"doi-asserted-by":"publisher","key":"e_1_2_1_14_1","DOI":"10.1016\/j.acalib.2006.08.002"},{"doi-asserted-by":"publisher","key":"e_1_2_1_15_1","DOI":"10.1109\/ICCV.2017.612"},{"doi-asserted-by":"publisher","key":"e_1_2_1_16_1","DOI":"10.1109\/TPAMI.2016.2599174"},{"doi-asserted-by":"publisher","key":"e_1_2_1_17_1","DOI":"10.1145\/1866158.1866171"},{"key":"e_1_2_1_18_1","volume-title":"Sergio Guadarrama, and Kevin P. Murphy.","author":"Fathi Alireza","year":"2017","unstructured":"Alireza Fathi , Zbigniew Wojna , Vivek Rathod , Peng Wang , Hyun Oh Song , Sergio Guadarrama, and Kevin P. Murphy. 2017 . Semantic instance segmentation via deep metric learning. CoRR abs\/1703.10277 (2017). Alireza Fathi, Zbigniew Wojna, Vivek Rathod, Peng Wang, Hyun Oh Song, Sergio Guadarrama, and Kevin P. Murphy. 2017. Semantic instance segmentation via deep metric learning. CoRR abs\/1703.10277 (2017)."},{"issue":"2019","key":"e_1_2_1_19_1","first-page":"16","article-title":"Clustering-driven deep embedding with pairwise constraints","volume":"39","author":"Fogel Sharon","year":"2018","unstructured":"Sharon Fogel , Hadar Averbuch-Elor , Jacov Goldberger , and Daniel Cohen-Or . 2018 . Clustering-driven deep embedding with pairwise constraints . IEEE Comput. Graph. Applic. 39 ( 2019 ) 16 - 27 . Sharon Fogel, Hadar Averbuch-Elor, Jacov Goldberger, and Daniel Cohen-Or. 2018. Clustering-driven deep embedding with pairwise constraints. IEEE Comput. Graph. Applic. 39 (2019) 16-27.","journal-title":"IEEE Comput. Graph. Applic."},{"doi-asserted-by":"publisher","key":"e_1_2_1_20_1","DOI":"10.1145\/2601097.2601131"},{"doi-asserted-by":"publisher","key":"e_1_2_1_21_1","DOI":"10.1109\/CVPR.2016.265"},{"doi-asserted-by":"publisher","key":"e_1_2_1_22_1","DOI":"10.5555\/2969033.2969125"},{"doi-asserted-by":"publisher","key":"e_1_2_1_23_1","DOI":"10.1007\/s11263-018-1116-0"},{"doi-asserted-by":"publisher","key":"e_1_2_1_24_1","DOI":"10.1109\/CVPR.2016.90"},{"doi-asserted-by":"publisher","key":"e_1_2_1_25_1","DOI":"10.1007\/s11263-015-0823-z"},{"doi-asserted-by":"publisher","key":"e_1_2_1_26_1","DOI":"10.1109\/TPAMI.2016.2598339"},{"doi-asserted-by":"publisher","key":"e_1_2_1_27_1","DOI":"10.1007\/BF02703309"},{"doi-asserted-by":"publisher","key":"e_1_2_1_28_1","DOI":"10.1007\/978-3-319-46493-0_15"},{"key":"e_1_2_1_29_1","volume-title":"Kingma and Jimmy Ba","author":"Diederik","year":"2015","unstructured":"Diederik P. Kingma and Jimmy Ba . 2015 . Adam : A method for stochastic optimization. CoRR abs\/1412.6980 (2015). Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. CoRR abs\/1412.6980 (2015)."},{"doi-asserted-by":"publisher","key":"e_1_2_1_30_1","DOI":"10.1111\/j.1467-8659.2008.01264.x"},{"key":"e_1_2_1_31_1","volume-title":"Image inpainting for irregular holes using partial convolutions. arXiv preprint arXiv:1804.07723","author":"Liu Guilin","year":"2018","unstructured":"Guilin Liu , Fitsum A. Reda , Kevin J. Shih , Ting-Chun Wang , Andrew Tao , and Bryan Catanzaro . 2018. Image inpainting for irregular holes using partial convolutions. arXiv preprint arXiv:1804.07723 ( 2018 ). Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. 2018. Image inpainting for irregular holes using partial convolutions. arXiv preprint arXiv:1804.07723 (2018)."},{"doi-asserted-by":"publisher","key":"e_1_2_1_32_1","DOI":"10.1145\/3313831.3376263"},{"unstructured":"Spandan Madan Zoya Bylinskii Matthew Tancik Adri\u00e0 Recasens Kimberli Zhong Sami Alsheikh Hanspeter Pfister Aude Oliva and Fredo Durand. 2018. Synthetically trained icon proposals for parsing and summarizing infographics. arXiv preprint arXiv:1807.10441. Spandan Madan Zoya Bylinskii Matthew Tancik Adri\u00e0 Recasens Kimberli Zhong Sami Alsheikh Hanspeter Pfister Aude Oliva and Fredo Durand. 2018. Synthetically trained icon proposals for parsing and summarizing infographics. arXiv preprint arXiv:1807.10441.","key":"e_1_2_1_33_1"},{"volume-title":"Proceedings of the British Machine Vision Conference (BMVC\u201918)","author":"Meyer Simone","unstructured":"Simone Meyer , Victor Cornill\u00e8re , Abdelaziz Djelouah , Christopher Schroers , and Markus H. Gross . 2018. Deep video color propagation . In Proceedings of the British Machine Vision Conference (BMVC\u201918) . Simone Meyer, Victor Cornill\u00e8re, Abdelaziz Djelouah, Christopher Schroers, and Markus H. Gross. 2018. Deep video color propagation. In Proceedings of the British Machine Vision Conference (BMVC\u201918).","key":"e_1_2_1_34_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_35_1","DOI":"10.1109\/34.244677"},{"doi-asserted-by":"publisher","key":"e_1_2_1_36_1","DOI":"10.1145\/882262.882269"},{"doi-asserted-by":"publisher","key":"e_1_2_1_37_1","DOI":"10.1111\/cgf.13193"},{"doi-asserted-by":"publisher","key":"e_1_2_1_38_1","DOI":"10.1145\/964696.964717"},{"unstructured":"Sohil Atul Shah and Vladlen Koltun. 2018. Deep continuous clustering. arXiv:1803.01449. Sohil Atul Shah and Vladlen Koltun. 2018. Deep continuous clustering. arXiv:1803.01449.","key":"e_1_2_1_39_1"},{"volume-title":"Automatic portrait segmentation for image stylization. Comput. Graph. Forum","author":"Shen Xiaoyong","unstructured":"Xiaoyong Shen , Aaron Hertzmann , Jiaya Jia , Sylvain Paris , Brian Price , Eli Shechtman , and Ian Sachs . 2016. Automatic portrait segmentation for image stylization. Comput. Graph. Forum , Vol. 35 . Wiley Online Library , 93\u2013102. Xiaoyong Shen, Aaron Hertzmann, Jiaya Jia, Sylvain Paris, Brian Price, Eli Shechtman, and Ian Sachs. 2016. Automatic portrait segmentation for image stylization. Comput. Graph. Forum, Vol. 35. Wiley Online Library, 93\u2013102.","key":"e_1_2_1_40_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_41_1","DOI":"10.5555\/1304596.1304846"},{"doi-asserted-by":"publisher","key":"e_1_2_1_42_1","DOI":"10.5555\/3045390.3045442"},{"doi-asserted-by":"publisher","key":"e_1_2_1_43_1","DOI":"10.1145\/2661229.2661247"},{"doi-asserted-by":"publisher","key":"e_1_2_1_44_1","DOI":"10.1145\/1661412.1618464"},{"doi-asserted-by":"publisher","key":"e_1_2_1_45_1","DOI":"10.1109\/CVPR.2016.556"},{"volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5315\u20135324","author":"Yang Xiao","unstructured":"Xiao Yang , Ersin Yumer , Paul Asente , Mike Kraley , Daniel Kifer , and C. Lee Giles . 2017. Learning to extract semantic structure from documents using multimodal fully convolutional neural networks . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5315\u20135324 . Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel Kifer, and C. Lee Giles. 2017. Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5315\u20135324.","key":"e_1_2_1_46_1"},{"volume-title":"Proceedings of the IEEE International Conference on Computer Vision.","author":"Zhu Jun-Yan","unstructured":"Jun-Yan Zhu , Taesung Park , Phillip Isola , and Alexei A. Efros . 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks . In Proceedings of the IEEE International Conference on Computer Vision. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision.","key":"e_1_2_1_47_1"}],"container-title":["ACM Transactions on Graphics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3451340","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3451340","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:17:30Z","timestamp":1750191450000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3451340"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,6,30]]},"references-count":47,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2021,6,30]]}},"alternative-id":["10.1145\/3451340"],"URL":"https:\/\/doi.org\/10.1145\/3451340","relation":{},"ISSN":["0730-0301","1557-7368"],"issn-type":[{"type":"print","value":"0730-0301"},{"type":"electronic","value":"1557-7368"}],"subject":[],"published":{"date-parts":[[2021,6,30]]},"assertion":[{"value":"2019-12-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-02-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-07-15","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}