{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,2]],"date-time":"2026-05-02T15:26:12Z","timestamp":1777735572838,"version":"3.51.4"},"reference-count":58,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2025,11,13]],"date-time":"2025-11-13T00:00:00Z","timestamp":1762992000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["J. Imaging"],"abstract":"<jats:p>Long-tailed image classification remains challenging for vision\u2013language models. Head classes dominate training while tail classes are underrepresented and noisy, and short prompts with weak text supervision further amplify head bias. This paper presents TASA, an end-to-end framework that stabilizes textual supervision and enhances cross-modal fusion. A Semantic Distribution Modulation (SDM) module constructs class-specific text prototypes by cosine-weighted fusion of multiple LLM-generated descriptions with a canonical template, providing stable and diverse semantic anchors without training text parameters. Dual-Space Cross-Modal Fusion (DCF) module incorporates selective-scan state\u2013space blocks into both image and text branches, enabling bidirectional conditioning and efficient feature fusion through a lightweight multilayer perceptron. Together with a margin-aware alignment loss, TASA aligns images with class prototypes for classification without requiring paired image\u2013text data or per-class prompt tuning. Experiments on CIFAR-10\/100-LT, ImageNet-LT, and Places-LT demonstrate consistent improvements across many-, medium-, and few-shot groups. Ablation studies confirm that DCF yields the largest single-module gain, while SDM and DCF combined provide the most robust and balanced performance. These results highlight the effectiveness of integrating text-driven prototypes with state\u2013space fusion for long-tailed classification.<\/jats:p>","DOI":"10.3390\/jimaging11110410","type":"journal-article","created":{"date-parts":[[2025,11,13]],"date-time":"2025-11-13T09:59:09Z","timestamp":1763027949000},"page":"410","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["TASA: Text-Anchored State\u2013Space Alignment for Long-Tailed Image Classification"],"prefix":"10.3390","volume":"11","author":[{"ORCID":"https:\/\/orcid.org\/0009-0007-1507-8504","authenticated-orcid":false,"given":"Long","family":"Li","sequence":"first","affiliation":[{"name":"School of Information Engineering, Chang\u2019an University, Xi\u2019an 710064, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-5723-6184","authenticated-orcid":false,"given":"Tinglei","family":"Jia","sequence":"additional","affiliation":[{"name":"School of Information Engineering, Chang\u2019an University, Xi\u2019an 710064, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-0570-3134","authenticated-orcid":false,"given":"Huaizhi","family":"Yue","sequence":"additional","affiliation":[{"name":"School of Information Engineering, Chang\u2019an University, Xi\u2019an 710064, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-3396-8468","authenticated-orcid":false,"given":"Huize","family":"Cheng","sequence":"additional","affiliation":[{"name":"School of Information Engineering, Chang\u2019an University, Xi\u2019an 710064, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-8898-4308","authenticated-orcid":false,"given":"Yongfeng","family":"Bu","sequence":"additional","affiliation":[{"name":"School of Information Engineering, Chang\u2019an University, Xi\u2019an 710064, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2287-2486","authenticated-orcid":false,"given":"Zhaoyang","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Information Engineering, Chang\u2019an University, Xi\u2019an 710064, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,11,13]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1837","DOI":"10.1007\/s11263-022-01622-8","article-title":"A survey on long-tailed visual recognition","volume":"130","author":"Yang","year":"2022","journal-title":"Int. J. Comput. Vis."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"10795","DOI":"10.1109\/TPAMI.2023.3268118","article-title":"Deep long-tailed learning: A survey","volume":"45","author":"Zhang","year":"2023","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., and Yu, S.X. (2019, January 15\u201320). Large-scale long-tailed recognition in an open world. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00264"},{"key":"ref_4","first-page":"1567","article-title":"Learning imbalanced datasets with label-distribution-aware margin loss","volume":"32","author":"Cao","year":"2019","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"5890","DOI":"10.1109\/TPAMI.2024.3369102","article-title":"Probabilistic contrastive learning for long-tailed visual recognition","volume":"46","author":"Du","year":"2024","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"113288","DOI":"10.1016\/j.knosys.2025.113288","article-title":"Knowledge-based natural answer generation via effective graph learning","volume":"316","author":"Liu","year":"2025","journal-title":"Knowl.-Based Syst."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"2493","DOI":"10.1007\/s11263-024-01983-2","article-title":"Geometric prior guided feature representation learning for long-tailed classification","volume":"132","author":"Ma","year":"2024","journal-title":"Int. J. Comput. Vis."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"13876","DOI":"10.1109\/TPAMI.2023.3298433","article-title":"The equalization losses: Gradient-driven training for long-tailed object recognition","volume":"45","author":"Tan","year":"2023","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"13670","DOI":"10.1109\/TNNLS.2025.3539314","article-title":"A systematic review on long-tailed learning","volume":"36","author":"Zhang","year":"2025","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_10","first-page":"75669","article-title":"How re-sampling helps for long-tail learning?","volume":"36","author":"Shi","year":"2023","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_11","unstructured":"Dong, B., Zhou, P., Yan, S., and Zuo, W. (2022). Lpt: Long-tailed prompt tuning for image classification. arXiv."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Wang, P., Han, K., Wei, X.S., Zhang, L., and Wang, L. (2021, January 20\u201325). Contrastive learning based hybrid networks for long-tailed image classification. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00100"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Zhu, J., Wang, Z., Chen, J., Chen, Y.P.P., and Jiang, Y.G. (2022, January 18\u201324). Balanced contrastive learning for long-tailed visual recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00678"},{"key":"ref_14","unstructured":"Xiao, Z., Chen, Z., Liu, L., Feng, Y., Wu, J., Liu, W., Zhou, J.T., Yang, H.H., and Liu, Z. (2024). Fedloge: Joint local and generic federated learning under long-tailed data. arXiv."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Wang, X., Yang, X., Yin, J., Wei, K., and Deng, C. (2024, January 16\u201322). Long-tail class incremental learning via independent sub-prototype construction. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.02702"},{"key":"ref_16","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 18\u201324). Learning transferable visual models from natural language supervision. Proceedings of the International Conference on Machine Learning, PmLR, Virtual."},{"key":"ref_17","first-page":"49250","article-title":"Instructblip: Towards general-purpose vision-language models with instruction tuning","volume":"36","author":"Dai","year":"2023","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_18","unstructured":"Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., and Mustafa, B. (2025). Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv."},{"key":"ref_19","first-page":"76298","article-title":"Locoop: Few-shot out-of-distribution detection via prompt learning","volume":"36","author":"Miyai","year":"2023","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Lafon, M., Ramzi, E., Rambour, C., Audebert, N., and Thome, N. (2024). Gallop: Learning global and local prompts for vision-language models. Lecture Notes in Computer Science, Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September\u20134 October 2024, Springer.","DOI":"10.1007\/978-3-031-73030-6_15"},{"key":"ref_21","unstructured":"Krizhevsky, A. (2009). Learning Multiple Layers of Features from Tiny Images, University of Toronto. Available online: https:\/\/www.cs.toronto.edu\/~kriz\/learning-features-2009-TR.pdf."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Zhong, Z., Cui, J., Liu, S., and Jia, J. (2021, January 20\u201325). Improving calibration for long-tailed recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01622"},{"key":"ref_23","first-page":"4175","article-title":"Balanced meta-softmax for long-tailed visual recognition","volume":"33","author":"Ren","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_24","unstructured":"Kandpal, N., Deng, H., Roberts, A., Wallace, E., and Raffel, C. (2023, January 23\u201329). Large language models struggle to learn long-tail knowledge. Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"64915","DOI":"10.52202\/079017-2072","article-title":"Llm-autoda: Large language model-driven automatic data augmentation for long-tailed problems","volume":"37","author":"Wang","year":"2024","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q.V. (2019, January 15\u201320). Autoaugment: Learning augmentation strategies from data. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00020"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"2337","DOI":"10.1007\/s11263-022-01653-1","article-title":"Learning to prompt for vision-language models","volume":"130","author":"Zhou","year":"2022","journal-title":"Int. J. Comput. Vis."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"5092","DOI":"10.1109\/TPAMI.2024.3361862","article-title":"Towards open vocabulary learning: A survey","volume":"46","author":"Wu","year":"2024","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_29","first-page":"12569","article-title":"Prompt pre-training with twenty-thousand classes for open-vocabulary visual recognition","volume":"36","author":"Ren","year":"2023","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"26701","DOI":"10.52202\/079017-0839","article-title":"Llm-esr: Large language models enhancement for long-tailed sequential recommendation","volume":"37","author":"Liu","year":"2024","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Pratt, S., Covert, I., Liu, R., and Farhadi, A. (2023, January 2\u20136). What does a platypus look like? generating customized prompts for zero-shot image classification. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCV51070.2023.01438"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Mirza, M.J., Karlinsky, L., Lin, W., Doveh, S., Micorek, J., Kozinski, M., Kuehne, H., and Possegger, H. (2024). Meta-prompting for automating zero-shot visual recognition with llms. Lecture Notes in Computer Science, Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September\u20134 October 2024, Springer.","DOI":"10.1007\/978-3-031-72627-9_21"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Wei, H., Yang, Y., Sun, S., Feng, M., Song, X., Lei, Q., Hu, H., Wang, R., Song, H., and Akhtar, N. (2025, January 10\u201317). Mono3DVLT: Monocular-Video-Based 3D Visual Language Tracking. Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA.","DOI":"10.1109\/CVPR52734.2025.01296"},{"key":"ref_34","unstructured":"Li, J., Li, D., Xiong, C., and Hoi, S. (2022, January 17\u201323). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA."},{"key":"ref_35","unstructured":"Hu, S., Tu, Y., Han, X., He, C., Cui, G., Long, X., Zheng, Z., Fang, Y., Huang, Y., and Zhao, W. (2024). Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"18792","DOI":"10.1109\/TNNLS.2025.3577292","article-title":"SFAN: Selective Filter and Alignment Network for Cross-Modal Retrieval","volume":"36","author":"Huang","year":"2025","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., and Misra, I. (2023, January 17\u201324). Imagebind: One embedding space to bind them all. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01457"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Jiang, Z., Meng, R., Yang, X., Yavuz, S., Zhou, Y., and Chen, W. (2024). Vlm2vec: Training vision-language models for massive multimodal embedding tasks. arXiv.","DOI":"10.36227\/techrxiv.175624545.56457516\/v2"},{"key":"ref_39","unstructured":"Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., and Kalantidis, Y. (2019). Decoupling representation and classifier for long-tailed recognition. arXiv."},{"key":"ref_40","unstructured":"Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., and Anadkat, S. (2023). Gpt-4 technical report. arXiv."},{"key":"ref_41","unstructured":"Wang, Y., Zhang, B., Hou, W., Wu, Z., Wang, J., and Shinozaki, T. (2023, January 11\u201314). Margin calibration for long-tailed visual recognition. Proceedings of the Asian Conference on Machine Learning, PMLR, \u0130stanbul, Turkey."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Goyal, P., Girshick, R., He, K., and Doll\u00e1r, P. (2017, January 22\u201329). Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.324"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Cui, Y., Jia, M., Lin, T.Y., Song, Y., and Belongie, S. (2019, January 15\u201320). Class-balanced loss based on effective number of samples. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00949"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Zhou, B., Cui, Q., Wei, X.S., and Chen, Z.M. (2020, January 14\u201319). Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00974"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Li, S., Gong, K., Liu, C.H., Wang, Y., Qiao, F., and Cheng, X. (2021, January 20\u201325). Metasaug: Meta semantic augmentation for long-tailed visual recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00517"},{"key":"ref_46","first-page":"3695","article-title":"Reslt: Residual learning for long-tailed recognition","volume":"45","author":"Cui","year":"2022","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Chou, H.P., Chang, S.C., Pan, J.Y., Wei, W., and Juan, D.C. (2020). Remix: Rebalanced mixup. Lecture Notes in Computer Science, Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23\u201328 August 2020, Springer.","DOI":"10.1007\/978-3-030-65414-6_9"},{"key":"ref_48","unstructured":"Menon, A.K., Jayasumana, S., Rawat, A.S., Jain, H., Veit, A., and Kumar, S. (2020). Long-tail learning via logit adjustment. arXiv."},{"key":"ref_49","unstructured":"Wang, X., Lian, L., Miao, Z., Liu, Z., and Yu, S.X. (2020). Long-tailed recognition by routing diverse distribution-aware experts. arXiv."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Samuel, D., and Chechik, G. (2021, January 10\u201317). Distributional robustness loss for long-tail learning. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00936"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Li, M., Cheung, Y.m., and Lu, Y. (2022, January 18\u201324). Long-tailed visual recognition via gaussian clouded logit adjustment. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.36227\/techrxiv.17031920.v1"},{"key":"ref_52","first-page":"34077","article-title":"Self-supervised aggregation of diverse experts for test-agnostic long-tailed recognition","volume":"35","author":"Zhang","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Du, F., Yang, P., Jia, Q., Nan, F., Chen, X., and Yang, Y. (2023, January 18\u201322). Global and local mixture consistency cumulative learning for long-tailed visual recognitions. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01518"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Xu, Z., Liu, R., Yang, S., Chai, Z., and Yuan, C. (2023, January 18\u201322). Learning imbalanced data with vision transformers. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01516"},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Jin, Y., Li, M., Lu, Y., Cheung, Y.m., and Wang, H. (2023, January 18\u201322). Long-tailed visual recognition via self-heterogeneous integration with knowledge excavation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.02269"},{"key":"ref_56","unstructured":"Shu, Y., Guo, X., Wu, J., Wang, X., Wang, J., and Long, M. (2023, January 23\u201329). Clipood: Generalizing clip to out-of-distributions. Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA."},{"key":"ref_57","doi-asserted-by":"crossref","first-page":"67","DOI":"10.1162\/tacl_a_00166","article-title":"From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions","volume":"2","author":"Young","year":"2014","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_58","first-page":"2579","article-title":"Visualizing data using t-SNE","volume":"9","author":"Hinton","year":"2008","journal-title":"J. Mach. Learn. Res."}],"container-title":["Journal of Imaging"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2313-433X\/11\/11\/410\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,13]],"date-time":"2025-11-13T10:35:32Z","timestamp":1763030132000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2313-433X\/11\/11\/410"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,13]]},"references-count":58,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2025,11]]}},"alternative-id":["jimaging11110410"],"URL":"https:\/\/doi.org\/10.3390\/jimaging11110410","relation":{},"ISSN":["2313-433X"],"issn-type":[{"value":"2313-433X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,13]]}}}