{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,2]],"date-time":"2026-06-02T22:49:12Z","timestamp":1780440552321,"version":"3.54.1"},"reference-count":58,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2023,2,27]],"date-time":"2023-02-27T00:00:00Z","timestamp":1677456000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,2,27]],"date-time":"2023-02-27T00:00:00Z","timestamp":1677456000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Mach. Intell. Res."],"published-print":{"date-parts":[[2023,6]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers (BERT) in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner, Kaleido-BERT. The code is available at <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/GewelsJI\/MVLT\">https:\/\/github.com\/GewelsJI\/MVLT<\/jats:ext-link>.<\/jats:p>","DOI":"10.1007\/s11633-022-1394-4","type":"journal-article","created":{"date-parts":[[2023,2,27]],"date-time":"2023-02-27T03:02:36Z","timestamp":1677466956000},"page":"421-434","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":18,"title":["Masked Vision-language Transformer in Fashion"],"prefix":"10.1007","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7092-2877","authenticated-orcid":false,"given":"Ge-Peng","family":"Ji","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2561-7712","authenticated-orcid":false,"given":"Mingchen","family":"Zhuge","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6636-5702","authenticated-orcid":false,"given":"Dehong","family":"Gao","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5245-7518","authenticated-orcid":false,"given":"Deng-Ping","family":"Fan","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1127-8887","authenticated-orcid":false,"given":"Christos","family":"Sakaridis","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3445-5711","authenticated-orcid":false,"given":"Luc Van","family":"Gool","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2023,2,27]]},"reference":[{"key":"1394_CR1","unstructured":"A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16 \u00d7 16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, 2021."},{"key":"1394_CR2","doi-asserted-by":"publisher","first-page":"9992","DOI":"10.1109\/ICCV48922.2021.00986","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"Z Liu","year":"2021","unstructured":"Z. Liu, Y. T. Lin, Y. Cao, H. Hu, Y. X. Wei, Z. Zhang, S. Lin, B. N. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of IEEE\/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 9992\u201310002, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.00986."},{"key":"1394_CR3","unstructured":"A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, US, pp. 6000\u20136010, 2017."},{"issue":"3","key":"1394_CR4","doi-asserted-by":"publisher","first-page":"169","DOI":"10.1007\/s11633-022-1331-6","volume":"19","author":"T X Sun","year":"2022","unstructured":"T. X. Sun, X. Y. Liu, X. P. Qiu, X. J. Huang. Paradigm shift in natural language processing. Machine Intelligence Research, vol. 19, no. 3, pp. 169\u2013183, 2022. DOI: https:\/\/doi.org\/10.1007\/s11633-022-1331-6.","journal-title":"Machine Intelligence Research"},{"key":"1394_CR5","unstructured":"S. Agarwal, G. Krueger, J. Clark, A. Radford, J. W. Kim, M. Brundage. Evaluating CLIP: Towards characterization of broader capabilities and downstream implications, [Online], Available: https:\/\/arxiv.org\/abs\/2108.02818, August 05, 2021."},{"key":"1394_CR6","unstructured":"M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, I. Sutskever. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning, Article number 233, 2020."},{"key":"1394_CR7","doi-asserted-by":"publisher","unstructured":"J. Y. Lin, R. Men, A. Yang, C. Zhou, Y. C. Zhang, P. Wang, J. R. Zhou, J. Tang, H. X. Yang. M6: Multi-modality-to-multi-modality multitask mega-transformer for unified pretraining. In Proceedings of the 27th ACM SIGK-DD Conference on Knowledge Discovery & Data Mining, Association for Computing Machinery, 2021. DOI: https:\/\/doi.org\/10.1145\/3447548.3467206.","DOI":"10.1145\/3447548.3467206"},{"key":"1394_CR8","unstructured":"A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, I. Sutskever. Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, pp. 8821\u20138831, 2021."},{"key":"1394_CR9","doi-asserted-by":"publisher","first-page":"11302","DOI":"10.1109\/CVPR46437.2021.01115","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"H Wu","year":"2021","unstructured":"H. Wu, Y. P. Gao, X. X. Guo, Z. Al-Halah, S. Rennie, K. Grauman, R. Feris. Fashion IQ: A new dataset towards retrieving images by natural language feedback. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 11302\u201311312, 2021. DOI: https:\/\/doi.org\/10.1109\/CVPR46437.2021.01115."},{"key":"1394_CR10","doi-asserted-by":"publisher","unstructured":"J. Devlin, M. W. Chang, K. Lee, K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, pp.4171\u20134186, 2019. DOI: https:\/\/doi.org\/10.18653\/v1\/N19-1423.","DOI":"10.18653\/v1\/N19-1423"},{"key":"1394_CR11","doi-asserted-by":"publisher","first-page":"770","DOI":"10.1109\/CVPR.2016.90","volume-title":"Proceedings of IEEE Conference on Computer Vision and Pattern Recognition","author":"K M He","year":"2016","unstructured":"K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 770\u2013778, 2016. DOI: https:\/\/doi.org\/10.1109\/CVPR.2016.90."},{"key":"1394_CR12","unstructured":"S. Q. Ren, K. M. He, R. Girshick, J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of 2015 Annual Conference on Neural Information Processing Systems, Montreal, Canada, pp. 91\u201399, 2015."},{"key":"1394_CR13","unstructured":"D. Qi, L. Su, J. Song, E. Cui, T. Bharti, A. Sacheti. Image-BERT: Cross-modal pre-training with large-scale weak-supervised image-text data, [Online], Available: https:\/\/arxiv.org\/abs\/2001.07966, January 23, 2020."},{"key":"1394_CR14","unstructured":"J. S. Lu, D. Batra, D. Parikh, S. Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33rd Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 13\u201323, 2019."},{"key":"1394_CR15","doi-asserted-by":"publisher","first-page":"104","DOI":"10.1007\/978-3-030-58577-8_7","volume-title":"Proceedings of the 16th European Conference on Computer Vision","author":"Y C Chen","year":"2020","unstructured":"Y. C. Chen, L. J. Li, L. C. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. J. Liu. UNITER: UNiversal image-TExt representation learning. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 104\u2013120, 2020. DOI: https:\/\/doi.org\/10.1007\/978-3-030-58577-8_7."},{"key":"1394_CR16","doi-asserted-by":"publisher","first-page":"5046","DOI":"10.1109\/ICCV.2019.00515","volume-title":"Proceedings of IEEE\/CVF International Conference On Computer Vision","author":"W L Hsiao","year":"2019","unstructured":"W. L. Hsiao, I. Katsman, C. Y. Wu, D. Parikh, K. Grauman. Fashion++: Minimal edits for outfit improvement. In Proceedings of IEEE\/CVF International Conference On Computer Vision, IEEE, Montreal, Canada, pp. 5046\u20135055, 2019. DOI: https:\/\/doi.org\/10.1109\/ICCV.2019.00515."},{"key":"1394_CR17","doi-asserted-by":"publisher","first-page":"405","DOI":"10.1007\/978-3-030-01270-0_24","volume-title":"Proceedings of the 15th European Conference on Computer Vision","author":"M I Vasileva","year":"2018","unstructured":"M. I. Vasileva, B. A. Plummer, K. Dusad, S. Rajpal, R. Kumar, D. Forsyth. Learning type-aware embeddings for fashion compatibility. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 405\u2013421, 2018. DOI: https:\/\/doi.org\/10.1007\/978-3-030-01270-0_24."},{"key":"1394_CR18","unstructured":"D. P. Fan, M C. Zhuge, L. Shao. Domain Specific Pre-Training of Cross Modality Transformer Model, US20220277218, September 2022."},{"key":"1394_CR19","doi-asserted-by":"publisher","unstructured":"D. H. Gao, L. B. Jin, B. Chen, M. H. Qiu, P. Li, Y. Wei, Y. Hu, H. Wang. FashionBERT: Text and image matching with adaptive loss for cross-modal retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp. 2251\u20132260, 2020. DOI: https:\/\/doi.org\/10.1145\/3397271.3401430.","DOI":"10.1145\/3397271.3401430"},{"key":"1394_CR20","doi-asserted-by":"publisher","first-page":"12642","DOI":"10.1109\/CVPR46437.2021.01246","volume-title":"Proceedings of IEEE\/CVF Conference on computer vision and pattern recognition","author":"M C Zhuge","year":"2021","unstructured":"M. C. Zhuge, D. H. Gao, D. P. Fan, L. B. Jin, B. Chen, H. M. Zhou, M. H. Qiu, L. Shao. Kaleido-BERT: Vision-language pre-training on fashion domain. In Proceedings of IEEE\/CVF Conference on computer vision and pattern recognition, IEEE, Nashville, USA, pp. 12642\u201312652, 2021. DOI: https:\/\/doi.org\/10.1109\/CVPR46437.2021.01246."},{"key":"1394_CR21","doi-asserted-by":"publisher","first-page":"548","DOI":"10.1109\/ICCV48922.2021.00061","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"W H Wang","year":"2021","unstructured":"W. H. Wang, E. Z. Xie, X. Li, D. P. Fan, K. T. Song, D. Liang, T. Lu, P. Luo, L. Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of IEEE\/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 548\u2013558, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.00061."},{"key":"1394_CR22","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/978-3-030-58601-0_1","volume-title":"Proceedings of the 16th European Conference on Computer Vision","author":"X W Yang","year":"2020","unstructured":"X. W. Yang, H. M. Zhang, D. Jin, Y. R. Liu, C. H. Wu, J. C. Tan, D. L. Xie, J. Wang, X. Wang. Fashion captioning: Towards generating accurate descriptions with semantic rewards. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 1\u201317, 2020. DOI: https:\/\/doi.org\/10.1007\/978-3-030-58601-0_1."},{"key":"1394_CR23","doi-asserted-by":"publisher","first-page":"10133","DOI":"10.1109\/CVPR42600.2020.01015","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Z Al-Halah","year":"2020","unstructured":"Z. Al-Halah, K. Grauman. From Paris to Berlin: Discovering fashion style influences around the world. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10133\u201310142, 2020. DOI: https:\/\/doi.org\/10.1109\/CVPR42600.2020.01015."},{"key":"1394_CR24","doi-asserted-by":"publisher","unstructured":"H. Tan, M. Bansal. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, pp. 5100\u20135111, 2019. DOI: https:\/\/doi.org\/10.18653\/v1\/D19-1514.","DOI":"10.18653\/v1\/D19-1514"},{"key":"1394_CR25","unstructured":"W. J. Su, X. Z. Zhu, Y. Cao, B. Li, L. W. Lu, F. R. Wei, J. F. Dai. VL-BERT: Pre-training of generic visual-linguistic representations. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020."},{"key":"1394_CR26","doi-asserted-by":"publisher","first-page":"212","DOI":"10.1007\/978-3-030-01225-0_13","volume-title":"Proceedings of the 15th European Conference on Computer Vision","author":"K H Lee","year":"2018","unstructured":"K. H. Lee, X. Chen, G. Hua, H. D. Hu, X. D. He. Stacked cross attention for image-text matching. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 212\u2013228, 2018. DOI: https:\/\/doi.org\/10.1007\/978-3-030-01225-0_13."},{"key":"1394_CR27","doi-asserted-by":"publisher","first-page":"1899","DOI":"10.1109\/ICCV.2017.208","volume-title":"Proceedings of IEEE International Conference on Computer Vision","author":"Z X Niu","year":"2017","unstructured":"Z. X. Niu, M. Zhou, L. Wang, X. B. Gao, G. Hua. Hierarchical multimodal LSTM for dense visual-semantic embedding. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 1899\u20131907, 2017. DOI: https:\/\/doi.org\/10.1109\/ICCV.2017.208."},{"key":"1394_CR28","unstructured":"J. Xia, M. Zhuge, T. Geng, S. Fan, Y. Wei, Z. He, F. Zheng. Skating-mixer: Multimodal MLP for scoring figure skating, [Online], Available: https:\/\/arxiv.org\/abs\/2203.03990, 2022."},{"key":"1394_CR29","doi-asserted-by":"publisher","first-page":"121","DOI":"10.1007\/978-3-030-58577-8_8","volume-title":"Proceedings of the 16th European Conference on Computer Vision","author":"X J Li","year":"2020","unstructured":"X. J. Li, X. Yin, C. Y. Li, P. C. Zhang, X. W. Hu, L. Zhang, L. J. Wang, H. D. Hu, L. Dong, F. R. Wei, Y. J. Choi, J. F. Gao. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 121\u2013137, 2020. DOI: https:\/\/doi.org\/10.1007\/978-3-030-58577-8_8."},{"key":"1394_CR30","doi-asserted-by":"publisher","unstructured":"M. C. Zhuge, D. P. Fan, N. Liu, D. W. Zhang, D. Xu, L. Shao. Salient object detection via integrity learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, to be published. DOI: https:\/\/doi.org\/10.1109\/TPAMI.2022.3179526.","DOI":"10.1109\/TPAMI.2022.3179526"},{"key":"1394_CR31","unstructured":"K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, Y. Bengio. Show, attend and h]Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, pp. 2048\u20132057, 2015."},{"key":"1394_CR32","doi-asserted-by":"crossref","unstructured":"T. Arici, M. S. Seyfioglu, T. Neiman, Y. Xu, S. Train, T. Chilimbi, B. Zeng, I. Tutar. MLIM: Vision-and-language model pre-training with masked language and image modeling, [Online], Available: https:\/\/arxiv.org\/abs\/2109.12178, September 24, 2021.","DOI":"10.31219\/osf.io\/tqy7r"},{"key":"1394_CR33","unstructured":"H. B. Bao, L. Dong, S. L. Piao, F. R. Wei. BEiT: BERT pre-training of image transformers. In Proceedings of the 10th International Conference on Learning Representations, 2022."},{"key":"1394_CR34","doi-asserted-by":"publisher","first-page":"15979","DOI":"10.1109\/CVPR52688.2022.01553","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"K M He","year":"2022","unstructured":"K. M. He, X. L. Chen, S. N. Xie, Y. H. Li, P. Doll\u00e1r, R. Girshick. Masked autoencoders are scalable vision learners. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 15979\u201315988, 2022. DOI: https:\/\/doi.org\/10.1109\/CVPR52688.2022.01553."},{"key":"1394_CR35","unstructured":"Z. C. Huang, Z. Y. Zeng, B. Liu, D. M. Fu, J. L. Fu. Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers, [Online], Available: https:\/\/arxiv.org\/abs\/2004.00849, June 22, 2020."},{"key":"1394_CR36","doi-asserted-by":"publisher","first-page":"7001","DOI":"10.1109\/CVPR46437.2021.00693","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"X D Lin","year":"2021","unstructured":"X. D. Lin, G. Bertasius, J. Wang, S. F. Chang, D. Parikh, L. Torresani. VX2TEXT: End-to-end learning of video-based text generation from multimodal inputs. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 7001\u20137011, 2021. DOI: https:\/\/doi.org\/10.1109\/CVPR46437.2021.00693."},{"key":"1394_CR37","unstructured":"W. Kim, B. Son, I. Kim. ViLT: Vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 5583\u20135594, 2021."},{"key":"1394_CR38","unstructured":"M. Yan, H. Y. Xu, C. L. Li, B. Bi, J. F. Tian, M. Gui, W. Wang. Grid-VLP: Revisiting grid features for vision-language pre-training, [Online], Available: https:\/\/arxiv.org\/abs\/2108.09479, August 21, 2021."},{"key":"1394_CR39","doi-asserted-by":"publisher","first-page":"12971","DOI":"10.1109\/CVPR46437.2021.01278","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Z C Huang","year":"2021","unstructured":"Z. C. Huang, Z. Y. Zeng, Y. P. Huang, B. Liu, D. M. Fu, J. L. Fu. Seeing out of the box: End-to-end pre-training for vision-language representation learning. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 12971\u201312980, 2021. DOI: https:\/\/doi.org\/10.1109\/CVPR46437.2021.01278."},{"key":"1394_CR40","doi-asserted-by":"publisher","first-page":"14085","DOI":"10.1109\/CVPR52688.2022.01371","volume-title":"Proceedings on Conference on computer vision and pattern recognition","author":"S Goenka","year":"2022","unstructured":"S. Goenka, Z. H. Zheng, A. Jaiswal, R. Chada, Y. Wu, V. Hedau, P. Natarajan. Fashionvlp: Vision language transformer for fashion retrieval with feedback. In Proceedings on Conference on computer vision and pattern recognition, IEEE, New Orleans, USA, pp. 14085\u201314095, 2022. DOI: https:\/\/doi.org\/10.1109\/CVPR52688.2022.01371."},{"key":"1394_CR41","doi-asserted-by":"publisher","first-page":"7327","DOI":"10.1109\/CVPR46437.2021.00725","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"J Lei","year":"2021","unstructured":"J. Lei, L. J. Li, L. W. Zhou, Z. Gan, T. L. Berg, M. Bansal, J. J. Liu. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 7327\u20137337, 2021. DOI: https:\/\/doi.org\/10.1109\/CVPR46437.2021.00725."},{"key":"1394_CR42","unstructured":"H. Y. Xu, M. Yan, C. L. Li, B. Bi, S. F. Huang, W. M. Xiao, F. Huang. E2E-VLP: End-to-end vision-language pre-training enhanced by visual learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 503\u2013513, 2021."},{"key":"1394_CR43","unstructured":"H. Akbari, L. Z. Yuan, R. Qian, W. H. Chuang, S. F. Chang, Y. Cui, B. Q. Gong. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems, pp. 24206\u201324221, 2021."},{"key":"1394_CR44","doi-asserted-by":"publisher","first-page":"269","DOI":"10.1145\/3298689.3346996","volume-title":"Proceedings of the 13th ACM Conference on Recommender Systems","author":"X Y Yi","year":"2019","unstructured":"X. Y. Yi, J. Yang, L. C. Hong, D. Z. Cheng, L. Heldt, A. Kumthekar, Z. Zhao, L. Wei, E. Chi. Sampling-bias-corrected neural modeling for large corpus item recommendations. In Proceedings of the 13th ACM Conference on Recommender Systems, ACM, Copenhagen, Denmark, pp. 269\u2013277, 2019. DOI: https:\/\/doi.org\/10.1145\/3298689.3346996."},{"key":"1394_CR45","doi-asserted-by":"publisher","first-page":"234","DOI":"10.1007\/978-3-319-24574-4_28","volume-title":"Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention","author":"O Ronneberger","year":"2015","unstructured":"O. Ronneberger, P. Fischer, T. Brox. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, Munich, Germany, pp. 234\u2013241, 2015. DOI: https:\/\/doi.org\/10.1007\/978-3-319-24574-4_28."},{"key":"1394_CR46","doi-asserted-by":"publisher","unstructured":"C. Alberti, J. Ling, M. Collins, D. Reitter. Fusion of detected objects in text for visual question answering. In Proceedings of Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, pp. 2131\u20132140, 2019. DOI: https:\/\/doi.org\/10.18653\/v1\/D19-1219.","DOI":"10.18653\/v1\/D19-1219"},{"key":"1394_CR47","unstructured":"N. Rostamzadeh, S. Hosseini, T. Boquet, W. Stokowiec, Y. Zhang, C. Jauvin, C. Pal. Fashion-gen: The generative fashion dataset and challenge, [Online], Available: https:\/\/arxiv.org\/abs\/1806.08317v1, July 30, 2018."},{"key":"1394_CR48","unstructured":"R. Kiros, R. Salakhutdinov, R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models, [Online], Available: https:\/\/arxiv.org\/abs\/1411.2539, 2014."},{"key":"1394_CR49","unstructured":"F. Faghri, D. J. Fleet, J. R. Kiros, S. Fidler. VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of British Machine Vision Conference, Newcastle, UK, 2018."},{"key":"1394_CR50","doi-asserted-by":"crossref","unstructured":"Y. X. Wang, H. Yang, X. M. Qian, L. Ma, J. Lu, B. Li, X. Fan. Position focused attention network for image-text matching. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, pp. 3792\u20133798, 2019.","DOI":"10.24963\/ijcai.2019\/526"},{"key":"1394_CR51","doi-asserted-by":"publisher","first-page":"248","DOI":"10.1109\/CVPR.2009.5206848","volume-title":"Proceedings of IEEE Conference on Computer Vision and Pattern Recognition","author":"J Deng","year":"2009","unstructured":"J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, F. F. Li. ImageNet: A large-scale hierarchical image database. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Miami, USA, pp. 248\u2013255, 2009. DOI: https:\/\/doi.org\/10.1109\/CVPR.2009.5206848."},{"key":"1394_CR52","unstructured":"A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 8748\u20138763, 2021."},{"key":"1394_CR53","doi-asserted-by":"publisher","first-page":"740","DOI":"10.1007\/978-3-319-10602-1_48","volume-title":"Proceedings of the 13th European Conference on Computer Vision","author":"T Y Lin","year":"2014","unstructured":"T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, C. L. Zitnick. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, Springer, Z\u00fcrich, Switzerland, pp. 740\u2013755, 2014. DOI: https:\/\/doi.org\/10.1007\/978-3-319-10602-1_48."},{"key":"1394_CR54","doi-asserted-by":"crossref","unstructured":"G. Li, N. Duan, Y. J. Fang, M. Gong, D. Jiang. Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, the 32nd Innovative Applications of Artificial Intelligence Conference, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence, New York, USA, pp. 11336\u201311344, 2020.","DOI":"10.1609\/aaai.v34i07.6795"},{"key":"1394_CR55","doi-asserted-by":"publisher","unstructured":"L. Wu, D. Y. Liu, X. J. Guo, R. C. Hong, L. C. Liu, R. Zhang. Multi-scale spatial representation learning via recursive hermite polynomial networks. In Proceedings of the 31st International Joint Conference on Artificial Intelligence, Vienna, Austria, pp. 1465\u20131473, 2022. DOI: https:\/\/doi.org\/10.24963\/ijcai.2022\/204.","DOI":"10.24963\/ijcai.2022\/204"},{"key":"1394_CR56","doi-asserted-by":"publisher","first-page":"3291","DOI":"10.1145\/3503161.3548195","volume-title":"Proceedings of the 30th ACM International Conference on Multimedia","author":"D P Chen","year":"2022","unstructured":"D. P. Chen, M. Wang, H. B. Chen, L. Wu, J. Qin, W. Peng. Cross-modal retrieval with heterogeneous graph embedding. In Proceedings of the 30th ACM International Conference on Multimedia, ACM, Lisboa, Portugal, pp. 3291\u20133300, 2022. DOI: https:\/\/doi.org\/10.1145\/3503161.3548195."},{"key":"1394_CR57","doi-asserted-by":"publisher","unstructured":"D. Y. Liu, L. Wu, F. Zheng, L. Q. Liu, M. Wang. Verbal-person nets: Pose-guided multi-granularity language-to-person generation. IEEE Transactions on Neural Networks and Learning Systems, to be published. DOI: https:\/\/doi.org\/10.1109\/TNNLS.2022.3151631.","DOI":"10.1109\/TNNLS.2022.3151631"},{"key":"1394_CR58","doi-asserted-by":"publisher","unstructured":"Z. Zhang, H. Y. Luo, L. Zhu, G. M. Lu, H. T. Shen. Modality-invariant asymmetric networks for cross-modal hashing. IEEE Transactions on Knowledge and Data Engineering, to be published. DOI: https:\/\/doi.org\/10.1109\/TKDE.2022.3144352.","DOI":"10.1109\/TKDE.2022.3144352"}],"container-title":["Machine Intelligence Research"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11633-022-1394-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11633-022-1394-4\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11633-022-1394-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,7,15]],"date-time":"2023-07-15T02:04:13Z","timestamp":1689386653000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11633-022-1394-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,27]]},"references-count":58,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2023,6]]}},"alternative-id":["1394"],"URL":"https:\/\/doi.org\/10.1007\/s11633-022-1394-4","relation":{},"ISSN":["2731-538X","2731-5398"],"issn-type":[{"value":"2731-538X","type":"print"},{"value":"2731-5398","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,2,27]]},"assertion":[{"value":"24 May 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 October 2022","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"27 February 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}