{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,30]],"date-time":"2026-05-30T06:01:11Z","timestamp":1780120871582,"version":"3.54.0"},"reference-count":37,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2026,4,6]],"date-time":"2026-04-06T00:00:00Z","timestamp":1775433600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,4,6]],"date-time":"2026-04-06T00:00:00Z","timestamp":1775433600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001691","name":"the Japan Society for the Promotion of Science","doi-asserted-by":"crossref","award":["J23H04974"],"award-info":[{"award-number":["J23H04974"]}],"id":[{"id":"10.13039\/501100001691","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["New Gener. Comput."],"published-print":{"date-parts":[[2026,5]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    In recent years, deep generative models for multimodal data have gained significant attention. Among these, multimodal variational autoencoders (VAEs) have emerged as a promising approach, aiming to capture a shared latent representation by integrating information across different modalities through their inference models. A primary challenge for multimodal VAEs is accurately inferring representations from arbitrary subsets of modalities after learning a multimodal inference model. Naively, this would require training\n                    <jats:inline-formula>\n                      <jats:alternatives>\n                        <jats:tex-math>$$ 2^M $$<\/jats:tex-math>\n                        <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                          <mml:msup>\n                            <mml:mn>2<\/mml:mn>\n                            <mml:mi>M<\/mml:mi>\n                          <\/mml:msup>\n                        <\/mml:math>\n                      <\/jats:alternatives>\n                    <\/jats:inline-formula>\n                    different inference networks (\n                    <jats:inline-formula>\n                      <jats:alternatives>\n                        <jats:tex-math>$$M$$<\/jats:tex-math>\n                        <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                          <mml:mi>M<\/mml:mi>\n                        <\/mml:math>\n                      <\/jats:alternatives>\n                    <\/jats:inline-formula>\n                    is # of modalities) to handle every possible combination of modalities, which is infeasible for a large number of modalities. Mixture-based models address this challenge by requiring only as many inference models as there are modalities, aggregating unimodal inferences to perform multimodal inference. However, when modalities are missing, these models suffer from information loss, particularly of modality-specific information, leading to deteriorated inference performance. Alternatively, alignment-based multimodal VAEs aim to align unimodal inference models with a multimodal inference model by minimizing the Kullback\u2013Leibler (KL) divergence between them. Yet, the multimodal amortized inference, which is alignment source in these models inherently suffers from amortization gaps, preventing it from perfectly approximating the true inference and compromising the accuracy of unimodal inference. To address both issues, we introduce an iterative amortized inference mechanism within the multimodal VAE framework, termed multimodal iterative amortized inference. By iteratively refining the multimodal inference using all modalities, this method overcomes the information loss due to missing modalities in mixture-based models and minimizes the amortization gap in alignment-based models. Furthermore, by aligning the unimodal inference to approximate this refined multimodal posterior, we obtain unimodal inferences that effectively incorporate multimodal information while requiring only unimodal inputs at inference time. Experimental results on two benchmark datasets demonstrate that the proposed method improves the performance of the inference itself, suggested by higher linear classification accuracy and cosine similarity, and that the learned representations effectively capture the distributions of other modalities, as indicated by lower Fr\u00e9chet Inception Distance (FID) scores in cross-modal generation. This indicates that the proposed approach significantly enhances the inferred representations from unimodal inputs.\n                  <\/jats:p>","DOI":"10.1007\/s00354-026-00321-z","type":"journal-article","created":{"date-parts":[[2026,4,6]],"date-time":"2026-04-06T15:55:31Z","timestamp":1775490931000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Enhancing Unimodal Latent Representations in Multimodal VAEs Through Iterative Amortized Inference"],"prefix":"10.1007","volume":"44","author":[{"ORCID":"https:\/\/orcid.org\/0009-0006-6016-3866","authenticated-orcid":false,"given":"Yuta","family":"Oshima","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Masahiro","family":"Suzuki","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yutaka","family":"Matsuo","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2026,4,6]]},"reference":[{"key":"321_CR1","doi-asserted-by":"publisher","first-page":"423","DOI":"10.1109\/TPAMI.2018.2798607","volume":"41","author":"T Baltru\u0161aitis","year":"2018","unstructured":"Baltru\u0161aitis, T., Ahuja, C., Morency, L.-P.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 423\u2013443 (2018)","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"321_CR2","unstructured":"Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)"},{"key":"321_CR3","unstructured":"Suzuki, M., Nakayama, K., Matsuo, Y.: Joint multimodal learning with deep generative models. arXiv preprint arXiv:1611.01891 (2016)"},{"key":"321_CR4","doi-asserted-by":"publisher","first-page":"1019","DOI":"10.1080\/01691864.2022.2035253","volume":"36","author":"M Suzuki","year":"2022","unstructured":"Suzuki, M., Matsuo, Y.: A survey of multimodal deep generative models. Adv. Robot. 36, 1019\u20131026 (2022)","journal-title":"Adv. Robot."},{"key":"321_CR5","unstructured":"Wu, M., Goodman, N.: Multimodal generative models for scalable weakly-supervised learning. In: Advances in Neural Information Processing Systems, pp. 5575\u20135585 (2018)"},{"key":"321_CR6","unstructured":"Shi, Y., Siddharth, N., Paige, B., Torr, P.: Variational mixture-of-experts autoencoders for multi-modal deep generative models. In: Advances in Neural Information Processing Systems, pp. 15718\u201315729 (2019)"},{"key":"321_CR7","unstructured":"Sutter, T.M., Daunhawer, I., Vogt, J.E.: Generalized multimodal ELBO. arXiv preprint arXiv:2105.02470 (2021)"},{"key":"321_CR8","doi-asserted-by":"publisher","first-page":"1771","DOI":"10.1162\/089976602760128018","volume":"14","author":"GE Hinton","year":"2002","unstructured":"Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14, 1771\u20131800 (2002)","journal-title":"Neural Comput."},{"key":"321_CR9","unstructured":"Daunhawer, I., Sutter, T.M., Chin-Cheong, K., Palumbo, E., Vogt, J.E.: On the limitations of multimodal VAEs. arXiv preprint arXiv:2110.04121 (2021)"},{"key":"321_CR10","first-page":"12194","volume":"34","author":"H Hwang","year":"2021","unstructured":"Hwang, H., et al.: Multi-view representation learning via total correlation objective. Adv. Neural. Inf. Process. Syst. 34, 12194\u201312207 (2021)","journal-title":"Adv. Neural. Inf. Process. Syst."},{"key":"321_CR11","unstructured":"Cremer, C., Li, X., Duvenaud, D.: Inference suboptimality in variational autoencoders. In: International Conference on Machine Learning (2018)"},{"key":"321_CR12","unstructured":"Sutter, T.M., Daunhawer, I., Vogt, J.E.: Multimodal generative learning utilizing Jensen\u2013Shannon divergence. arXiv preprint arXiv:2006.08242 (2020)"},{"key":"321_CR13","unstructured":"Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 dataset. Tech. Rep, California Institute of Technology (2011)"},{"key":"321_CR14","unstructured":"Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30 (2017)"},{"key":"321_CR15","unstructured":"Greff, K., et\u00a0al.: Multi-object representation learning with iterative variational inference. In: International Conference on Machine Learning (2019)"},{"key":"321_CR16","unstructured":"Vedantam, R., Fischer, I., Huang, J., Murphy, K.: Generative models of visually grounded imagination. In: International Conference on Learning Representations (2018)"},{"key":"321_CR17","doi-asserted-by":"crossref","unstructured":"Korthals, T., Rudolph, D., Leitner, J., Hesse, M., R\u00fcckert, U.: Multi-modal generative models for learning epistemic active sensing. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 3319\u20133325 (2019)","DOI":"10.1109\/ICRA.2019.8794458"},{"key":"321_CR18","unstructured":"Wu, M., Goodman, N.: Multimodal generative models for compositional representation learning. arXiv preprint arXiv:1912.05075 (2019)"},{"key":"321_CR19","unstructured":"Tsai, Y.-H.H., Liang, P.P., Zadeh, A., Morency, L.-P., Salakhutdinov, R.: Learning factorized multimodal representations. arXiv preprint arXiv:1806.06176 (2018)"},{"key":"321_CR20","unstructured":"Hsu, W.-N., Glass, J.: Disentangling by partitioning: A representation learning framework for multimodal sensory data. arXiv preprint arXiv:1805.11264 (2018)"},{"key":"321_CR21","doi-asserted-by":"crossref","unstructured":"Lee, M., Pavlovic, V.: Private-shared disentangled multimodal VAE for learning of hybrid latent representations. arXiv preprint arXiv:2012.13024 (2020)","DOI":"10.1109\/CVPRW53098.2021.00185"},{"key":"321_CR22","first-page":"459","volume":"12544","author":"I Daunhawer","year":"2021","unstructured":"Daunhawer, I., Sutter, T.M., Marcinkevi\u010ds, R., Vogt, J.E.: Self-supervised disentanglement of modality-specific and shared factors improves multimodal generative models. Pattern Recognit. 12544, 459 (2021)","journal-title":"Pattern Recognit."},{"key":"321_CR23","unstructured":"Palumbo, E., Daunhawer, I., Vogt, J.E.: MMVAE+: enhancing the generative quality of multimodal VAEs without compromises. In: Fifth Symposium on Advances in Approximate Bayesian Inference-Fast Track (2023)"},{"key":"321_CR24","unstructured":"Sutter, T.M., Vogt, J.E.: Multimodal relational VAE. arXiv preprint (2021)"},{"key":"321_CR25","unstructured":"Wolff, J., et\u00a0al.: Hierarchical multimodal variational autoencoders. arXiv preprint (2021)"},{"key":"321_CR26","doi-asserted-by":"publisher","first-page":"238","DOI":"10.1016\/j.neunet.2021.11.019","volume":"146","author":"M Vasco","year":"2022","unstructured":"Vasco, M., Yin, H., Melo, F.S., Paiva, A.: Leveraging hierarchy in multimodal generative models for effective cross-modality inference. Neural Netw. 146, 238\u2013255 (2022)","journal-title":"Neural Netw."},{"key":"321_CR27","unstructured":"Vaswani, A., et\u00a0al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)"},{"key":"321_CR28","unstructured":"Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)"},{"key":"321_CR29","first-page":"36479","volume":"35","author":"C Saharia","year":"2022","unstructured":"Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst. 35, 36479\u201336494 (2022)","journal-title":"Adv. Neural. Inf. Process. Syst."},{"key":"321_CR30","unstructured":"Radford, A., et\u00a0al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748\u20138763 (2021)"},{"key":"321_CR31","doi-asserted-by":"crossref","unstructured":"Bachmann, R., Mizrahi, D., Atanov, A., Zamir, A.: MultiMAE: multi-modal multi-task masked autoencoders. arXiv preprint arXiv:2204.01678 (2022)","DOI":"10.1007\/978-3-031-19836-6_20"},{"key":"321_CR32","unstructured":"Dosovitskiy, A., et\u00a0al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)"},{"key":"321_CR33","unstructured":"Marino, J., Yue, Y., Mandt, S.: Iterative amortized inference. In: International Conference on Machine Learning, pp. 3403\u20133412 (2018)"},{"key":"321_CR34","unstructured":"Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)"},{"key":"321_CR35","unstructured":"Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in neural information processing systems, vol. 30 (2017)"},{"key":"321_CR36","unstructured":"Clevert, D.-A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUs). arXiv preprint arXiv:1511.07289 (2015)"},{"key":"321_CR37","unstructured":"Ba, J.L., Kiros, J.R. Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)"}],"container-title":["New Generation Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00354-026-00321-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00354-026-00321-z","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00354-026-00321-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,5,30]],"date-time":"2026-05-30T05:12:48Z","timestamp":1780117968000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00354-026-00321-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,4,6]]},"references-count":37,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,5]]}},"alternative-id":["321"],"URL":"https:\/\/doi.org\/10.1007\/s00354-026-00321-z","relation":{},"ISSN":["0288-3635","1882-7055"],"issn-type":[{"value":"0288-3635","type":"print"},{"value":"1882-7055","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,4,6]]},"assertion":[{"value":"3 October 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 March 2026","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"6 April 2026","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no conflict of interest relevant to the contents of this article.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}],"article-number":"17"}}