{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,16]],"date-time":"2026-06-16T15:06:35Z","timestamp":1781622395867,"version":"3.54.5"},"publisher-location":"New York, NY, USA","reference-count":35,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,10,18]],"date-time":"2021-10-18T00:00:00Z","timestamp":1634515200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Science Foundation of China","award":["61872113, 62006061, U1813215"],"award-info":[{"award-number":["61872113, 62006061, U1813215"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,10,18]]},"DOI":"10.1145\/3462244.3479965","type":"proceedings-article","created":{"date-parts":[[2021,10,15]],"date-time":"2021-10-15T15:01:58Z","timestamp":1634310118000},"page":"682-686","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["Semi-supervised Visual Feature Integration for Language Models through Sentence Visualization"],"prefix":"10.1145","author":[{"given":"Lisai","family":"Zhang","sequence":"first","affiliation":[{"name":"Intelligent Computing Research Center, Harbin Institute of Technology, Shenzhen, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Qingcai","family":"Chen","sequence":"additional","affiliation":[{"name":"Intelligenet Computing Research Center, Harbin Institute of Technology, Shenzhen, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Joanna","family":"Siebert","sequence":"additional","affiliation":[{"name":"Intelligenet Computing Research Center, Harbin Institute of Technology, Shenzhen, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Buzhou","family":"Tang","sequence":"additional","affiliation":[{"name":"Intelligenet Computing Research Center, Harbin Institute of Technology, Shenzhen, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2021,10,18]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.artint.2005.04.009"},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1075"},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.5555\/2655713.2655714"},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58577-8_7"},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"crossref","unstructured":"Grzegorz Chrupa\u0142a \u00c0kos K\u00e1d\u00e1r and Afra Alishahi. 2015. Learning language through pictures. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) Vol.\u00a02. 112\u2013118.  Grzegorz Chrupa\u0142a \u00c0kos K\u00e1d\u00e1r and Afra Alishahi. 2015. Learning language through pictures. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) Vol.\u00a02. 112\u2013118.","DOI":"10.3115\/v1\/P15-2019"},{"key":"e_1_3_2_2_7_1","unstructured":"Junyoung Chung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555(2014).  Junyoung Chung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555(2014)."},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v31i1.11155"},{"key":"e_1_3_2_2_9_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018)."},{"key":"e_1_3_2_2_10_1","unstructured":"Fartash Faghri David\u00a0J Fleet Jamie\u00a0Ryan Kiros and Sanja Fidler. 2017. VSE++: improved visual-semantic embeddings. arXiv preprint arXiv:1707.05612(2017).  Fartash Faghri David\u00a0J Fleet Jamie\u00a0Ryan Kiros and Sanja Fidler. 2017. VSE++: improved visual-semantic embeddings. arXiv preprint arXiv:1707.05612(2017)."},{"key":"e_1_3_2_2_11_1","unstructured":"Andrea Frome Greg\u00a0S Corrado Jon Shlens Samy Bengio Jeff Dean Marc'\u00a0Aurelio Ranzato and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In Advances in Neural Information Processing Systems 26 C.\u00a0J.\u00a0C. Burges L.\u00a0Bottou M.\u00a0Welling Z.\u00a0Ghahramani and K.\u00a0Q. Weinberger(Eds.). 2121\u20132129.  Andrea Frome Greg\u00a0S Corrado Jon Shlens Samy Bengio Jeff Dean Marc'\u00a0Aurelio Ranzato and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In Advances in Neural Information Processing Systems 26 C.\u00a0J.\u00a0C. Burges L.\u00a0Bottou M.\u00a0Welling Z.\u00a0Ghahramani and K.\u00a0Q. Weinberger(Eds.). 2121\u20132129."},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_3_2_2_14_1","unstructured":"Andrej Karpathy Armand Joulin and Li\u00a0F Fei-Fei. 2014. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. In Advances in Neural Information Processing Systems 27. 1889\u20131897.  Andrej Karpathy Armand Joulin and Li\u00a0F Fei-Fei. 2014. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. In Advances in Neural Information Processing Systems 27. 1889\u20131897."},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1005"},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1038"},{"key":"e_1_3_2_2_17_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980(2014).","author":"Kingma P","year":"2014","unstructured":"Diederik\u00a0 P Kingma and Jimmy Ba . 2014 . Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980(2014). Diederik\u00a0P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980(2014)."},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1085"},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6795"},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_2_2_21_1","volume-title":"In Proceedings of LREC.","author":"Luisa Bentivogli Raffaella\u00a0Bernardi Marco Baroni","year":"2014","unstructured":"Marco Baroni Luisa Bentivogli Raffaella\u00a0Bernardi Marco\u00a0Marelli, Stefano\u00a0Menini and Roberto Zamparelli . 2014 . A SICK cure for the evaluation of compositional distributional semantic models . In In Proceedings of LREC. Marco Baroni Luisa Bentivogli Raffaella\u00a0Bernardi Marco\u00a0Marelli, Stefano\u00a0Menini and Roberto Zamparelli. 2014. A SICK cure for the evaluation of compositional distributional semantic models. In In Proceedings of LREC."},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.208"},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/S18-1119"},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.303"},{"key":"e_1_3_2_2_25_1","volume-title":"Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966(2020).","author":"Qi Di","year":"2020","unstructured":"Di Qi , Lin Su , Jia Song , Edward Cui , Taroon Bharti , and Arun Sacheti . 2020 . Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966(2020). Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. 2020. Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966(2020)."},{"key":"e_1_3_2_2_26_1","unstructured":"Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91\u201399.  Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91\u201399."},{"key":"e_1_3_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P14-1068"},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00177"},{"key":"e_1_3_2_2_29_1","unstructured":"Robert Speer and Catherine Havasi. 2012. Representing General Relational Knowledge in ConceptNet 5.. In LREC. 3679\u20133686.  Robert Speer and Catherine Havasi. 2012. Representing General Relational Knowledge in ConceptNet 5.. In LREC. 3679\u20133686."},{"key":"e_1_3_2_2_30_1","volume-title":"VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In International Conference on Learning Representations.","author":"Su Weijie","year":"2019","unstructured":"Weijie Su , Xizhou Zhu , Yue Cao , Bin Li , Lewei Lu , Furu Wei , and Jifeng Dai . 2019 . VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In International Conference on Learning Representations. Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In International Conference on Learning Representations."},{"key":"e_1_3_2_2_31_1","volume-title":"Integration of visual and linguistic information in spoken language comprehension. Science 268, 5217","author":"Tanenhaus K","year":"1995","unstructured":"Michael\u00a0 K Tanenhaus , Michael\u00a0 J Spivey-Knowlton , Kathleen\u00a0 M Eberhard , and Julie\u00a0 C Sedivy . 1995. Integration of visual and linguistic information in spoken language comprehension. Science 268, 5217 ( 1995 ), 1632\u20131634. Michael\u00a0K Tanenhaus, Michael\u00a0J Spivey-Knowlton, Kathleen\u00a0M Eberhard, and Julie\u00a0C Sedivy. 1995. Integration of visual and linguistic information in spoken language comprehension. Science 268, 5217 (1995), 1632\u20131634."},{"key":"e_1_3_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/S18-1120"},{"key":"e_1_3_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12031"},{"key":"e_1_3_2_2_34_1","volume-title":"International conference on machine learning. 2048\u20132057","author":"Xu Kelvin","year":"2015","unstructured":"Kelvin Xu , Jimmy Ba , Ryan Kiros , Kyunghyun Cho , Aaron Courville , Ruslan Salakhudinov , Rich Zemel , and Yoshua Bengio . 2015 . Show, attend and tell: Neural image caption generation with visual attention . In International conference on machine learning. 2048\u20132057 . Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048\u20132057."},{"key":"e_1_3_2_2_35_1","volume-title":"Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, Sheila\u00a0A. McIlraith and Kilian\u00a0Q. Weinberger (Eds.). 5626\u20135633","author":"Zablocki Eloi","year":"2018","unstructured":"Eloi Zablocki , Benjamin Piwowarski , Laure Soulier , and Patrick Gallinari . 2018 . Learning Multi-Modal Word Representation Grounded in Visual Context . In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, Sheila\u00a0A. McIlraith and Kilian\u00a0Q. Weinberger (Eds.). 5626\u20135633 . Eloi Zablocki, Benjamin Piwowarski, Laure Soulier, and Patrick Gallinari. 2018. Learning Multi-Modal Word Representation Grounded in Visual Context. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, Sheila\u00a0A. McIlraith and Kilian\u00a0Q. Weinberger (Eds.). 5626\u20135633."}],"event":{"name":"ICMI '21: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION","location":"Montr\u00e9al QC Canada","acronym":"ICMI '21","sponsor":["SIGCHI ACM Special Interest Group on Computer-Human Interaction"]},"container-title":["Proceedings of the 2021 International Conference on Multimodal Interaction"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3462244.3479965","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3462244.3479965","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:19:01Z","timestamp":1750191541000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3462244.3479965"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,18]]},"references-count":35,"alternative-id":["10.1145\/3462244.3479965","10.1145\/3462244"],"URL":"https:\/\/doi.org\/10.1145\/3462244.3479965","relation":{},"subject":[],"published":{"date-parts":[[2021,10,18]]},"assertion":[{"value":"2021-10-18","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}