{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,14]],"date-time":"2026-04-14T19:56:18Z","timestamp":1776196578361,"version":"3.50.1"},"reference-count":43,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2025,8,1]],"date-time":"2025-08-01T00:00:00Z","timestamp":1754006400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Ministry of Research, Innovation and Digitization","award":["CNCS\/CCCDI-UEFISCDI"],"award-info":[{"award-number":["CNCS\/CCCDI-UEFISCDI"]}]},{"name":"Ministry of Research, Innovation and Digitization","award":["COFUND-CETP-SMART-LEM-1"],"award-info":[{"award-number":["COFUND-CETP-SMART-LEM-1"]}]},{"DOI":"10.13039\/501100006595","name":"PNCDI IV","doi-asserted-by":"publisher","award":["CNCS\/CCCDI-UEFISCDI"],"award-info":[{"award-number":["CNCS\/CCCDI-UEFISCDI"]}],"id":[{"id":"10.13039\/501100006595","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100006595","name":"PNCDI IV","doi-asserted-by":"publisher","award":["COFUND-CETP-SMART-LEM-1"],"award-info":[{"award-number":["COFUND-CETP-SMART-LEM-1"]}],"id":[{"id":"10.13039\/501100006595","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["JTAER"],"abstract":"<jats:p>In this work, the utility of multimodal vision\u2013language models (VLMs) for visual product understanding in e-commerce is investigated, focusing on two complementary models: ColQwen2 (vidore\/colqwen2-v1.0) and ColPali (vidore\/colpali-v1.2-hf). These models are integrated into two architectures and evaluated across various product interpretation tasks, including image-grounded question answering, brand recognition and visual retrieval based on natural language prompts. ColQwen2, built on the Qwen2-VL backbone with LoRA-based adapter hot-swapping, demonstrates strong performance, allowing end-to-end image querying and text response synthesis. It excels at identifying attributes such as brand, color or usage based solely on product images and responds fluently to user questions. In contrast, ColPali, which utilizes the PaliGemma backbone, is optimized for explainability. It delivers detailed visual-token alignment maps that reveal how specific regions of an image contribute to retrieval decisions, offering transparency ideal for diagnostics or educational applications. Through comparative experiments using footwear imagery, it is demonstrated that ColQwen2 is highly effective in generating accurate responses to product-related questions, while ColPali provides fine-grained visual explanations that reinforce trust and model accountability.<\/jats:p>","DOI":"10.3390\/jtaer20030191","type":"journal-article","created":{"date-parts":[[2025,8,4]],"date-time":"2025-08-04T10:48:08Z","timestamp":1754304488000},"page":"191","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Transforming Product Discovery and Interpretation Using Vision\u2013Language Models"],"prefix":"10.3390","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9005-5181","authenticated-orcid":false,"given":"Simona-Vasilica","family":"Oprea","sequence":"first","affiliation":[{"name":"Department of Economic and Informatics and Cybernetics, Bucharest University of Economic Studies, No. 6 Piata Romana, 010374 Bucharest, Romania"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0961-352X","authenticated-orcid":false,"given":"Adela","family":"B\u00e2ra","sequence":"additional","affiliation":[{"name":"Department of Economic and Informatics and Cybernetics, Bucharest University of Economic Studies, No. 6 Piata Romana, 010374 Bucharest, Romania"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,8,1]]},"reference":[{"key":"ref_1","unstructured":"Li, H., Yuan, P., Xu, S., Wu, Y., He, X., and Zhou, B. (2020, January 7\u201312). Aspect-Aware Multimodal Summarization for Chinese e-Commerce Products. Proceedings of the AAAI 2020\u201434th AAAI Conference on Artificial Intelligence, New York, NY, USA."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Khandelwal, A., Kulkarni, S.S., Mittal, H., and Gupta, D. (2023, January 9\u201314). Large Scale Generative Multimodal Attribute Extraction for E-Commerce Attributes. Proceedings of the Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada.","DOI":"10.18653\/v1\/2023.acl-industry.29"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"581","DOI":"10.1007\/s11263-023-01891-x","article-title":"CLIP-Adapter: Better Vision-Language Models with Feature Adapters","volume":"132","author":"Gao","year":"2024","journal-title":"Int. J. Comput. Vis."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Tashu, T.M., Fattouh, S., Kiss, P., and Horvath, T. (2022, January 16\u201318). Multimodal E-Commerce Product Classification Using Hierarchical Fusion. Proceedings of the 2022 IEEE 2nd Conference on Information Technology and Data Science, CITDS 2022\u2014Proceedings, Debrecen, Hungary.","DOI":"10.1109\/CITDS54976.2022.9914136"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Hewawalpita, S., and Perera, I. (2019, January 28). Multimodal User Interaction Framework for E-Commerce. Proceedings of the IEEE International Research Conference on Smart Computing and Systems Engineering, SCSE 2019, Colombo, Sri Lanka.","DOI":"10.23919\/SCSE.2019.8842815"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Hendriksen, M. (2022). Multimodal Retrieval in E-Commerce: From Categories to Images, Text, and Back. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Springer.","DOI":"10.1007\/978-3-030-99739-7_62"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"846","DOI":"10.1109\/TVCG.2021.3114781","article-title":"VideoModerator: A Risk-Aware Framework for Multimodal Video Moderation in E-Commerce","volume":"28","author":"Tang","year":"2022","journal-title":"IEEE Trans. Vis. Comput. Graph."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"101847","DOI":"10.1016\/j.inffus.2023.101847","article-title":"Emotion Recognition from Unimodal to Multimodal Analysis: A Review","volume":"99","author":"Ezzameli","year":"2023","journal-title":"Inf. Fusion"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"8323467","DOI":"10.1155\/2022\/8323467","article-title":"Analysis of Precision Service of Agricultural Product E-Commerce Based on Multimodal Collaborative Filtering Algorithm","volume":"2022","author":"Zhuansun","year":"2022","journal-title":"Math. Probl. Eng."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"113984","DOI":"10.1016\/j.dss.2023.113984","article-title":"How Do You Say It Matters? A Multimodal Analytics Framework for Product Return Prediction in Live Streaming e-Commerce","volume":"172","author":"Xu","year":"2023","journal-title":"Decis. Support Syst."},{"key":"ref_11","first-page":"5568208","article-title":"Multimodal Data Guided Spatial Feature Fusion and Grouping Strategy for E-Commerce Commodity Demand Forecasting","volume":"2021","author":"Cai","year":"2021","journal-title":"Mob. Inf. Syst."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"114104","DOI":"10.1016\/j.dss.2023.114104","article-title":"A Multimodal Analytics Framework for Product Sales Prediction with the Reputation of Anchors in Live Streaming E-Commerce","volume":"177","author":"Xu","year":"2024","journal-title":"Decis. Support Syst."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"1","DOI":"10.4018\/JGIM.315322","article-title":"The Multimodal Emotion Information Analysis of E-Commerce Online Pricing in Electronic Word of Mouth","volume":"30","author":"Chen","year":"2022","journal-title":"J. Glob. Inf. Manag."},{"key":"ref_14","unstructured":"Koh, J.Y., Salakhutdinov, R., and Fried, D. (2023, January 23\u201329). Grounding Language Models to Images for Multimodal Generation. Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Sun, Q., Wang, Y., Xu, C., Zheng, K., Yang, Y., Hu, H., Xu, F., Zhang, J., Geng, X., and Jiang, D. (2022, January 22\u201327). Multimodal Dialogue Response Generation. Proceedings of the Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland.","DOI":"10.18653\/v1\/2022.acl-long.204"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"9267487","DOI":"10.1155\/2023\/9267487","article-title":"Beyond Words: An Intelligent Human-Machine Dialogue System with Multimodal Generation and Emotional Comprehension","volume":"2023","author":"Zhao","year":"2023","journal-title":"Int. J. Intell. Syst."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., and Wen, J.R. (2023, January 6\u201310). Evaluating Object Hallucination in Large Vision-Language Models. Proceedings of the EMNLP 2023\u20142023 Conference on Empirical Methods in Natural Language Processing, Singapore.","DOI":"10.18653\/v1\/2023.emnlp-main.20"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Ma, H., Fan, B., Ng, B.K., and Lam, C.T. (2024). VL-Meta: Vision-Language Models for Multimodal Meta-Learning. Mathematics, 12.","DOI":"10.3390\/math12020286"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Wu, J., Yu, T., and Li, S. (2021, January 20\u201324). Deconfounded and Explainable Interactive Vision-Language Retrieval of Complex Scenes. Proceedings of the MM 2021\u2014Proceedings of the 29th ACM International Conference on Multimedia, Virtual.","DOI":"10.1145\/3474085.3475366"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Sammani, F., Mukherjee, T., and Deligiannis, N. (2022, January 18\u201324). NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks. Proceedings of the Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00814"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"5625","DOI":"10.1109\/TPAMI.2024.3369699","article-title":"Vision-Language Models for Vision Tasks: A Survey","volume":"46","author":"Zhang","year":"2024","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"392","DOI":"10.1007\/s11263-023-01876-w","article-title":"Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective","volume":"132","author":"Wu","year":"2024","journal-title":"Int. J. Comput. Vis."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"224","DOI":"10.1007\/s11263-023-01868-w","article-title":"Exploring Vision-Language Models for Imbalanced Learning","volume":"132","author":"Wang","year":"2024","journal-title":"Int. J. Comput. Vis."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"2337","DOI":"10.1007\/s11263-022-01653-1","article-title":"Learning to Prompt for Vision-Language Models","volume":"130","author":"Zhou","year":"2022","journal-title":"Int. J. Comput. Vis."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Zhou, K., Yang, J., Loy, C.C., and Liu, Z. (2022, January 18\u201324). Conditional Prompt Learning for Vision-Language Models. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01631"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Wang, H., Huang, R., and Zhang, J. (2022). Research Progress on Vision\u2013Language Multimodal Pretraining Model Technology. Electronics, 11.","DOI":"10.3390\/electronics11213556"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"103885","DOI":"10.1016\/j.cag.2024.01.012","article-title":"A Survey of Efficient Fine-Tuning Methods for Vision-Language Models\u2014Prompt and Adapter","volume":"119","author":"Xing","year":"2024","journal-title":"Comput. Graph."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Wang, L., He, J., Li, S., Liu, N., and Lim, E.P. (2024). Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Springer.","DOI":"10.1007\/978-3-031-53302-0_3"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Du, Y., Wei, F., Zhang, Z., Shi, M., Gao, Y., and Li, G. (2022, January 18\u201324). Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01369"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Shin, W., Park, J., Woo, T., Cho, Y., Oh, K., and Song, H. (2022, January 17\u201321). E-CLIP: Large-Scale Vision-Language Representation Learning in E-Commerce. Proceedings of the International Conference on Information and Knowledge Management, Atlanta, GA, USA.","DOI":"10.1145\/3511808.3557067"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Jin, Y., Li, Y., Yuan, Z., and Mu, Y. (2023, January 17\u201324). Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-Commerce. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01064"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Moratelli, N., Barraco, M., Morelli, D., Cornia, M., Baraldi, L., and Cucchiara, R. (2023). Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates. Sensors, 23.","DOI":"10.3390\/s23031286"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Chen, B., Jin, L., Wang, X., Gao, D., Jiang, W., and Ning, W. (May, January 30). Unified Vision-Language Representation Modeling for E-Commerce Same-Style Products Retrieval. Proceedings of the ACM Web Conference 2023\u2014Companion of the World Wide Web Conference, WWW 2023, Austin, TX, USA.","DOI":"10.1145\/3543873.3584632"},{"key":"ref_34","first-page":"5","article-title":"Artificial intelligence: Friend or Foe? Experts\u2019 Concerns on European AI Act","volume":"57","author":"Matei","year":"2023","journal-title":"Econ. Comput. Econ. Cybern. Stud. Res."},{"key":"ref_35","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 18\u201324). Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the Machine Learning Research, Online."},{"key":"ref_36","unstructured":"Li, J., Li, D., Savarese, S., and Hoi, S. (2023, January 23\u201329). BLIP-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models. Proceedings of the Machine Learning Research, Honolulu, HI, USA."},{"key":"ref_37","unstructured":"Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K\u00fcttler, H., Lewis, M., Yih, W.T., and Rockt\u00e4schel, T. (2020, January 6\u201312). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Proceedings of the Advances in Neural Information Processing Systems, Virtue."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"336","DOI":"10.1007\/s11263-019-01228-7","article-title":"Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization","volume":"128","author":"Selvaraju","year":"2020","journal-title":"Int. J. Comput. Vis."},{"key":"ref_39","unstructured":"Doshi-Velez, F., and Kim, B. (2017). A Roadmap for a Rigorous Science of Interpretability. arXiv."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Gao, D., Jin, L., Chen, B., Qiu, M., Li, P., Wei, Y., Hu, Y., and Wang, H. (2020, January 25\u201330). FashionBERT: Text and Image Matching with Adaptive Loss for Cross-Modal Retrieval. Proceedings of the SIGIR 2020\u2014Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtue.","DOI":"10.1145\/3397271.3401430"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"1885","DOI":"10.1007\/s11280-021-00913-3","article-title":"Attribute-Aware Explainable Complementary Clothing Recommendation","volume":"24","author":"Li","year":"2021","journal-title":"World Wide Web"},{"key":"ref_42","unstructured":"Kim, W., Son, B., and Kim, I. (2021, January 18\u201324). ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. Proceedings of the Proceedings of Machine Learning Research, Online."},{"key":"ref_43","unstructured":"Hu, E., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2022, January 25). Lora: Low-Rank Adaptation of Large Language Models. Proceedings of the ICLR 2022\u201410th International Conference on Learning Representations, Virtue."}],"container-title":["Journal of Theoretical and Applied Electronic Commerce Research"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/0718-1876\/20\/3\/191\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:21:11Z","timestamp":1760034071000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/0718-1876\/20\/3\/191"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,1]]},"references-count":43,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2025,9]]}},"alternative-id":["jtaer20030191"],"URL":"https:\/\/doi.org\/10.3390\/jtaer20030191","relation":{},"ISSN":["0718-1876"],"issn-type":[{"value":"0718-1876","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,8,1]]}}}