{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,5]],"date-time":"2026-06-05T15:40:12Z","timestamp":1780674012151,"version":"3.54.1"},"reference-count":34,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2024,7,29]],"date-time":"2024-07-29T00:00:00Z","timestamp":1722211200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Natural Science Foundation of China","award":["61702234"],"award-info":[{"award-number":["61702234"]}]},{"name":"National Natural Science Foundation of China","award":["25422217"],"award-info":[{"award-number":["25422217"]}]},{"name":"Open Fund for Innovative Research on Ship Overall Performance","award":["61702234"],"award-info":[{"award-number":["61702234"]}]},{"name":"Open Fund for Innovative Research on Ship Overall Performance","award":["25422217"],"award-info":[{"award-number":["25422217"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Symmetry"],"abstract":"<jats:p>Multimodal knowledge graph completion necessitates the integration of information from multiple modalities (such as images and text) into the structural representation of entities to improve link prediction. However, most existing studies have overlooked the interaction between different modalities and the symmetry in the modal fusion process. To address this issue, this paper proposed a Transformer-based knowledge graph link prediction model (MM-Transformer) that fuses multimodal features. Different modal encoders are employed to extract structural, visual, and textual features, and symmetrical hybrid key-value calculations are performed on features from different modalities based on the Transformer architecture. The similarities of textual tags to structural tags and visual tags are calculated and aggregated, respectively, and multimodal entity representations are modeled and optimized to reduce the heterogeneity of the representations. The experimental results show that compared with the current multimodal SOTA method, MKGformer, MM-Transformer improves the Hits@1 and Hits@10 evaluation indicators by 1.17% and 1.39%, respectively, proving that the proposed method can effectively solve the problem of multimodal feature fusion in the knowledge graph link prediction task.<\/jats:p>","DOI":"10.3390\/sym16080961","type":"journal-article","created":{"date-parts":[[2024,8,1]],"date-time":"2024-08-01T15:26:53Z","timestamp":1722526013000},"page":"961","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":11,"title":["MM-Transformer: A Transformer-Based Knowledge Graph Link Prediction Model That Fuses Multimodal Features"],"prefix":"10.3390","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7341-2776","authenticated-orcid":false,"given":"Dongsheng","family":"Wang","sequence":"first","affiliation":[{"name":"School of Computer, Jiangsu University of Science and Technology, Zhenjiang 212100, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Kangjie","family":"Tang","sequence":"additional","affiliation":[{"name":"School of Computer, Jiangsu University of Science and Technology, Zhenjiang 212100, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jun","family":"Zeng","sequence":"additional","affiliation":[{"name":"School of Computer, Jiangsu University of Science and Technology, Zhenjiang 212100, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yue","family":"Pan","sequence":"additional","affiliation":[{"name":"School of Computer, Jiangsu University of Science and Technology, Zhenjiang 212100, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yun","family":"Dai","sequence":"additional","affiliation":[{"name":"Department of Information Management, Jiangsu Justice Police Vocational College, Nanjing 211805, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Huige","family":"Li","sequence":"additional","affiliation":[{"name":"School of Computer, Jiangsu University of Science and Technology, Zhenjiang 212100, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Bin","family":"Han","sequence":"additional","affiliation":[{"name":"School of Computer, Jiangsu University of Science and Technology, Zhenjiang 212100, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2024,7,29]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Huang, X., Zhang, J., Li, D., and Li, P. (2019, January 11\u201315). Knowledge graph embedding based question answering. Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, Melbourne, VIC, Australia.","DOI":"10.1145\/3289600.3290956"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Yih, S.W., Chang, M.W., He, X., and Gao, J. (2015, January 26). Semantic parsing via staged query graph generation: Question answering with knowledge base. Proceedings of the Joint Conference of the 53rd Annual Meeting of the ACL and the 7th International Joint Conference on Natural Language Processing of the AFNLP, Beijing, China.","DOI":"10.3115\/v1\/P15-1128"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Zhou, H., Young, T., Huang, M., Zhao, H., Xu, J., and Zhu, X. (2018, January 13\u201319). Commonsense knowledge aware conversation generation with graph attention. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence IJCAI, Stockholm, Sweden.","DOI":"10.24963\/ijcai.2018\/643"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Huang, J., Zhao, W.X., Dou, H., Wen, J.R., and Chang, E.Y. (2018, January 8\u201312). Improving sequential recommendation with knowledge-enhanced memory networks. Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, New York, NY, USA.","DOI":"10.1145\/3209978.3210017"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Zhang, N., Jia, Q., Deng, S., Chen, X., Ye, H., Chen, H., Tou, H., Huang, G., Wang, Z., and Hua, N. (2021, January 14\u201318). Alicg: Fine-grained and evolvable conceptual graph construction for semantic search at alibaba. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore.","DOI":"10.1145\/3447548.3467057"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Dietz, L., Kotov, A., and Meij, E. (2018, January 8\u201312). Utilizing knowledge graphs for text-centric information retrieval. Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Tokyo, Japan.","DOI":"10.1145\/3209978.3210187"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Yang, Z. (2020, January 25\u201330). Biomedical information retrieval incorporating knowledge graph for explainable precision medicine. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event.","DOI":"10.1145\/3397271.3401458"},{"key":"ref_8","first-page":"2787","article-title":"Translating embeddings for modeling multi-relational data","volume":"26","author":"Bordes","year":"2013","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Wang, Z., Zhang, J., Feng, J., and Chen, Z. (2014, January 27\u201331). Knowledge graph embedding by translating on hyperplanes. Proceedings of the AAAI Conference on Artificial Intelligence, Qutbec City, QC, Canada.","DOI":"10.1609\/aaai.v28i1.8870"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Nathani, D., Chauhan, J., Sharma, C., and Kaul, M. (2019). Learning attention-based embeddings for relation prediction in knowledge graphs. arXiv.","DOI":"10.18653\/v1\/P19-1466"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Nguyen, D.Q., Nguyen, T.D., Nguyen, D.Q., and Phung, D. (2017). A novel embedding model for knowledge base completion based on convolutional neural network. arXiv.","DOI":"10.18653\/v1\/N18-2053"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Pezeshkpour, P., Chen, L., and Singh, S. (2018). Embedding multimodal relational data for knowledge base completion. arXiv.","DOI":"10.18653\/v1\/D18-1359"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Mousselly-Sergieh, H., Botschen, T., Gurevych, I., and Roth, S. (2018, January 5\u20136). A multimodal translation-based approach for knowledge graph representation learning. Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, New Orleans, LA, USA.","DOI":"10.18653\/v1\/S18-2027"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Xie, R., Liu, Z., Luan, H., and Sun, M. (2017, January 19\u201325). Image-embodied knowledge representation learning. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Melbourne, VIC, Australia.","DOI":"10.24963\/ijcai.2017\/438"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"63373","DOI":"10.1109\/ACCESS.2019.2916887","article-title":"Deep multimodal representation learning: A survey","volume":"7","author":"Guo","year":"2019","journal-title":"IEEE Access"},{"key":"ref_16","first-page":"13","article-title":"Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks","volume":"32","author":"Lu","year":"2019","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_17","unstructured":"Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., and Chang, K.W. (2019). Visualbert: A simple and performant baseline for vision and language. arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., and Liu, J. (2020, January 23\u201328). Uniter: Universal image-text representation learning. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58577-8_7"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Wang, Z., Li, L., Li, Q., and Zeng, D. (2019, January 14\u201319). Multimodal data enhanced representation learning for knowledge graphs. Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary.","DOI":"10.1109\/IJCNN.2019.8852079"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Zhao, Y., Cai, X., Wu, Y., Zhang, H., Zhang, Y., Zhao, G., and Jiang, N. (2022). MoSE: Modality split and ensemble for multimodal knowledge graph completion. arXiv.","DOI":"10.18653\/v1\/2022.emnlp-main.719"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Wang, M., Wang, S., Yang, H., Zhang, Z., Chen, X., and Qi, G. (2021, January 20\u201324). Is visual context really helpful for knowledge graph? A representation learning perspective. Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event.","DOI":"10.1145\/3474085.3475470"},{"key":"ref_22","unstructured":"Shankar, S., Thompson, L., and Fiterau, M. (2022). Progressive fusion for multimodal integration. arXiv."},{"key":"ref_23","unstructured":"Liang, P.P., Ling, C.K., Cheng, Y., Obolenskiy, A., Liu, Y., Pandey, R., and Salakhutdinov, R. (2023, January 1\u20135). Quantifying Interactions in Semi-supervised Multimodal Learning: Guarantees and Applications. Proceedings of the Twelfth International Conference on Learning Representations, Kigali, Rwanda."},{"key":"ref_24","unstructured":"Jiang, Y., Gao, Y., Zhu, Z., Yan, C., and Gao, Y. (2023, September 22). HyperRep: Hypergraph-Based Self-Supervised Multimodal Representation Learning. Available online: https:\/\/openreview.net\/forum?id=y3dqBDnPay."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Golovanevsky, M., Schiller, E., Nair, A.A., Singh, R., and Eickhoff, C. (2024, January 27). One-Versus-Others Attention: Scalable Multimodal Integration for Biomedical Data. Proceedings of the ICML 2024 Workshop on Efficient and Accessible Foundation Models for Biological Discovery, Vienna, Austria.","DOI":"10.1142\/9789819807024_0041"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Zhang, X., Yoon, J., Bansal, M., and Yao, H. (2024, January 17\u201321). Multimodal representation learning by alternating unimodal adaptation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.02592"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Li, X., Zhao, X., Xu, J., Zhang, Y., and Xing, C. (May, January 30). IMF: Interactive multimodal fusion model for link prediction. Proceedings of the ACM Web Conference 2023, Austin, TX, USA.","DOI":"10.1145\/3543507.3583554"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Chen, X., Zhang, N., Li, L., Deng, S., Tan, C., Xu, C., Huang, F., Si, L., and Chen, H. (2022, January 11\u201315). Hybrid transformer with multi-level fusion for multimodal knowledge graph completion. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain.","DOI":"10.1145\/3477495.3531992"},{"key":"ref_29","unstructured":"Gu, W., Gao, F., Lou, X., and Zhang, J. (2019). Link prediction via graph attention network. arXiv."},{"key":"ref_30","unstructured":"Alexey, D. (2021, January 3\u20137). An image is worth 16x16 words: Transformers for image recognition at scale. Proceedings of the 9th International Conference on Learning Representations, Virtual Event."},{"key":"ref_31","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2019, January 2\u20137). Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA."},{"key":"ref_32","first-page":"5998","article-title":"Attention is all you need","volume":"30","author":"Vaswani","year":"2017","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"39","DOI":"10.1145\/219717.219748","article-title":"WordNet: A lexical database for English","volume":"38","author":"Miller","year":"1995","journal-title":"Commun. ACM"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Bollacker, K., Evans, C., Paritosh, P., Sturge, T., and Taylor, J. (2008, January 9\u201312). Freebase: A collaboratively created graph database for structuring human knowledge. Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada.","DOI":"10.1145\/1376616.1376746"}],"container-title":["Symmetry"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-8994\/16\/8\/961\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T15:25:35Z","timestamp":1760109935000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-8994\/16\/8\/961"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,29]]},"references-count":34,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2024,8]]}},"alternative-id":["sym16080961"],"URL":"https:\/\/doi.org\/10.3390\/sym16080961","relation":{},"ISSN":["2073-8994"],"issn-type":[{"value":"2073-8994","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,7,29]]}}}