{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,18]],"date-time":"2025-10-18T21:02:47Z","timestamp":1760821367177,"version":"build-2065373602"},"reference-count":35,"publisher":"MDPI AG","issue":"12","license":[{"start":{"date-parts":[[2022,12,10]],"date-time":"2022-12-10T00:00:00Z","timestamp":1670630400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"ERC","award":["H2020-ERC-2017-ADG 788506","3E210691"],"award-info":[{"award-number":["H2020-ERC-2017-ADG 788506","3E210691"]}]},{"name":"KU Leuven Postdoctoral Mandate","award":["H2020-ERC-2017-ADG 788506","3E210691"],"award-info":[{"award-number":["H2020-ERC-2017-ADG 788506","3E210691"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computers"],"abstract":"<jats:p>Understanding multimedia content remains a challenging problem in e-commerce search and recommendation applications. It is difficult to obtain item representations that capture the relevant product attributes since these product attributes are fine-grained and scattered across product images with huge visual variations and product descriptions that are noisy and incomplete. In addition, the interpretability and explainability of item representations have become more important in order to make e-commerce applications more intelligible to humans. Multimodal disentangled representation learning, where the independent generative factors of multimodal data are identified and encoded in separate subsets of features in the feature space, is an interesting research area to explore in an e-commerce context given the benefits of the resulting disentangled representations such as generalizability, robustness and interpretability. However, the characteristics of real-word e-commerce data, such as the extensive visual variation, noisy and incomplete product descriptions, and complex cross-modal relations of vision and language, together with the lack of an automatic interpretation method to explain the contents of disentangled representations, means that current approaches for multimodal disentangled representation learning do not suffice for e-commerce data. Therefore, in this work, we design an explainable variational autoencoder framework (E-VAE) which leverages visual and textual item data to obtain disentangled item representations by jointly learning to disentangle the visual item data and to infer a two-level alignment of the visual and textual item data in a multimodal disentangled space. As such, E-VAE tackles the main challenges in disentangling multimodal e-commerce data. Firstly, with the weak supervision of the two-level alignment our E-VAE learns to steer the disentanglement process towards discovering the relevant factors of variations in the multimodal data and to ignore irrelevant visual variations which are abundant in e-commerce data. Secondly, to the best of our knowledge our E-VAE is the first VAE-based framework that has an automatic interpretation mechanism that allows to explain the components of the disentangled item representations with text. With our textual explanations we provide insight in the quality of the disentanglement. Furthermore, we demonstrate that with our explainable disentangled item representations we achieve state-of-the-art outfit recommendation results on the Polyvore Outfits dataset and report new state-of-the-art cross-modal search results on the Amazon Dresses dataset.<\/jats:p>","DOI":"10.3390\/computers11120182","type":"journal-article","created":{"date-parts":[[2022,12,12]],"date-time":"2022-12-12T02:15:33Z","timestamp":1670811333000},"page":"182","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Learning Explainable Disentangled Representations of E-Commerce Data by Aligning Their Visual and Textual Attributes"],"prefix":"10.3390","volume":"11","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1376-9030","authenticated-orcid":false,"given":"Katrien","family":"Laenen","sequence":"first","affiliation":[{"name":"Human-Computer Interaction group, KU Leuven, 3000 Leuven, Belgium"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3732-9323","authenticated-orcid":false,"given":"Marie-Francine","family":"Moens","sequence":"additional","affiliation":[{"name":"Human-Computer Interaction group, KU Leuven, 3000 Leuven, Belgium"}]}],"member":"1968","published-online":{"date-parts":[[2022,12,10]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1016\/j.inffus.2019.12.012","article-title":"Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI","volume":"58","author":"Bennetot","year":"2020","journal-title":"Inf. Fusion"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1798","DOI":"10.1109\/TPAMI.2013.50","article-title":"Representation Learning: A Review and New Perspectives","volume":"35","author":"Bengio","year":"2013","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_3","unstructured":"Kingma, D.P., and Welling, M. (2013). Auto-Encoding Variational Bayes. arXiv."},{"key":"ref_4","unstructured":"Rezende, D.J., Mohamed, S., and Wierstra, D. (2014, January 22\u201324). Stochastic Backpropagation and Approximate Inference in Deep Generative Models. Proceedings of the 31st International Conference on Machine Learning, Bejing, China."},{"key":"ref_5","unstructured":"Higgins, I., Matthey, L., Pal, A., Burgess, C.P., Glorot, X., Botvinick, M.M., Mohamed, S., and Lerchner, A. (2017, January 24\u201326). beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. Proceedings of the International Conference on Learning Representations, Toulon, France."},{"key":"ref_6","unstructured":"Kim, H., and Mnih, A. (2018, January 10\u201315). Disentangling by Factorising. Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Liu, Z., Luo, P., Wang, X., and Tang, X. (2015, January 7\u201313). Deep Learning Face Attributes in the Wild. Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.425"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Paysan, P., Knothe, R., Amberg, B., Romdhani, S., and Vetter, T. (2009, January 2\u20134). A 3D Face Model for Pose and Illumination Invariant Face Recognition. Proceedings of the 2009 IEEE International Conference on Advanced Video and Signal Based Surveillance, Genova, Italy.","DOI":"10.1109\/AVSS.2009.58"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Aubry, M., Maturana, D., Efros, A.A., Russell, B.C., and Sivic, J. (2014, January 23\u201328). Seeing 3D Chairs: Exemplar Part-Based 2D-3D Alignment Using a Large Dataset of CAD Models. Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.487"},{"key":"ref_10","unstructured":"Yang, Z., Hu, Z., Salakhutdinov, R., and Berg-Kirkpatrick, T. (2017, January 6\u201311). Improved Variational Autoencoders for Text Modeling using Dilated Convolutions. Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Zhu, Q., Bi, W., Liu, X., Ma, X., Li, X., and Wu, D. (2020). A Batch Normalized Inference Network Keeps the KL Vanishing Away. arXiv.","DOI":"10.18653\/v1\/2020.acl-main.235"},{"key":"ref_12","unstructured":"Wallach, H., Larochelle, H., Beygelzimer, A., d\u2019Alch\u00e9-Buc, F., Fox, E., and Garnett, R. (2019). Learning Disentangled Representations for Recommendation. Advances in Neural Information Processing Systems 32, Curran Associates, Inc."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Hou, Y., Vig, E., Donoser, M., and Bazzani, L. (2021, January 10\u201317). Learning Attribute-Driven Disentangled Representations for Interactive Fashion Retrieval. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.01193"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Bouchacourt, D., Tomioka, R., and Nowozin, S. (2018, January 2\u20137). Multi-Level Variational Autoencoder: Learning Disentangled Representations From Grouped Observations. Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.11867"},{"key":"ref_15","unstructured":"Locatello, F., Poole, B., Raetsch, G., Sch\u00f6lkopf, B., Bachem, O., and Tschannen, M. (2020, January 13\u201318). Weakly-Supervised Disentanglement Without Compromises. Proceedings of the 37th International Conference on Machine Learning, Virtual."},{"key":"ref_16","unstructured":"Chen, H., Chen, Y., Wang, X., Xie, R., Wang, R., Xia, F., and Zhu, W. (2021, January 6\u201314). Curriculum Disentangled Recommendation with Noisy Multi-feedback. Proceedings of the Advances in Neural Information Processing Systems 34 (NeurIPS 2021), Virtual Event."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Zhu, Z., He, Y., and Caverlee, J. (2020, January 22\u201326). Content-Collaborative Disentanglement Representation Learning for Enhanced Recommendation. Proceedings of the Fourteenth ACM Conference on Recommender Systems, Virtual Event.","DOI":"10.1145\/3383313.3412239"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Zhu, Y., and Chen, Z. (2022). Variational Bandwidth Auto-encoder for Hybrid Recommender Systems. IEEE Trans. Knowl. Data Eng., early access.","DOI":"10.1109\/TKDE.2022.3155408"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Theodoridis, T., Chatzis, T., Solachidis, V., Dimitropoulos, K., and Daras, P. (2020, January 14\u201319). Cross-modal Variational Alignment of Latent Spaces. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA.","DOI":"10.1109\/CVPRW50498.2020.00488"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Lee, M., and Pavlovic, V. (2021, January 19\u201325). Private-Shared Disentangled Multimodal VAE for Learning of Latent Representations. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Nashville, TN, USA.","DOI":"10.1109\/CVPRW53098.2021.00185"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Chen, Z., Wang, X., Xie, X., Wu, T., Bu, G., Wang, Y., and Chen, E. (2019, January 10\u201316). Co-Attentive Multi-Task Learning for Explainable Recommendation. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization, Macau, China.","DOI":"10.24963\/ijcai.2019\/296"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Truong, Q.T., and Lauw, H. (2019, January 13\u201317). Multimodal Review Generation for Recommender Systems. Proceedings of the The World Wide Web Conference, San Francisco, CA, USA.","DOI":"10.1145\/3308558.3313463"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Hou, M., Wu, L., Chen, E., Li, Z., Zheng, V.W., and Liu, Q. (2019). Explainable Fashion Recommendation: A Semantic Attribute Region Guided Approach. arXiv.","DOI":"10.24963\/ijcai.2019\/650"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Chen, X., Chen, H., Xu, H., Zhang, Y., Cao, Y., Qin, Z., and Zha, H. (2019, January 21\u201325). Personalized Fashion Recommendation with Visual Explanations Based on Multimodal Attention Network: Towards Visually Explainable Recommendation. Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France.","DOI":"10.1145\/3331184.3331254"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Liu, W., Li, R., Zheng, M., Karanam, S., Wu, Z., Bhanu, B., Radke, R.J., and Camps, O. (2020, January 13\u201319). Towards Visually Explaining Variational Autoencoders. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00867"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Laenen, K., Zoghbi, S., and Moens, M.F. (2018, January 5\u20139). Web Search of Fashion Items with Multimodal Querying. Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, Marina Del Rey, CA, USA.","DOI":"10.1145\/3159652.3159716"},{"key":"ref_28","unstructured":"Mikolov, T., Chen, K., Corrado, G.S., and Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Yang, Z., He, X., Gao, J., Deng, L., and Smola, A.J. (2016, January 27\u201330). Stacked Attention Networks for Image Question Answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.10"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Liu, C., Mao, Z., Liu, A.A., Zhang, T., Wang, B., and Zhang, Y. (2019, January 21\u201325). Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching. Proceedings of the 27th ACM International Conference on Multimedia, Nice, France.","DOI":"10.1145\/3343031.3350869"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Vasileva, M.I., Plummer, B.A., Dusad, K., Rajpal, S., Kumar, R., and Forsyth, D.A. (2018, January 8\u201314). Learning Type-Aware Embeddings for Fashion Compatibility. Proceedings of the European Conference on Computer Vision, Munich, Germany.","DOI":"10.1007\/978-3-030-01270-0_24"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"31","DOI":"10.17706\/IJCEE.2016.8.1.31-43","article-title":"Fashion Meets Computer Vision and NLP at e-Commerce Search","volume":"8","author":"Zoghbi","year":"2016","journal-title":"Int. J. Comput. Electr. Eng."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Lee, K.H., Chen, X., Hua, G., Hu, H., and He, X. (2018, January 8\u201314). Stacked Cross Attention for Image-Text Matching. Proceedings of the European Conference on Computer Vision, Munich, Germany.","DOI":"10.1007\/978-3-030-01225-0_13"},{"key":"ref_34","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. The International Conference on Learning Representations. arXiv."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Liu, Z., Luo, P., Qiu, S., Wang, X., and Tang, X. (2016, January 27\u201330). DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.124"}],"container-title":["Computers"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-431X\/11\/12\/182\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T01:37:42Z","timestamp":1760146662000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-431X\/11\/12\/182"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,12,10]]},"references-count":35,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2022,12]]}},"alternative-id":["computers11120182"],"URL":"https:\/\/doi.org\/10.3390\/computers11120182","relation":{},"ISSN":["2073-431X"],"issn-type":[{"type":"electronic","value":"2073-431X"}],"subject":[],"published":{"date-parts":[[2022,12,10]]}}}