{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,22]],"date-time":"2026-04-22T19:13:01Z","timestamp":1776885181163,"version":"3.51.2"},"reference-count":27,"publisher":"MDPI AG","issue":"12","license":[{"start":{"date-parts":[[2025,12,15]],"date-time":"2025-12-15T00:00:00Z","timestamp":1765756800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Shanghai Municipal Education Commission Educational Science Planning","award":["C2023035"],"award-info":[{"award-number":["C2023035"]}]},{"name":"Sanda University"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Recent advancements in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in processing diverse data types, yet significant disparities persist between human cognitive processes and computational approaches to multimodal information integration. This research presents a systematic investigation into the parallels between human cross-modal chunking mechanisms and token representation methodologies in MLLMs. Through empirical studies comparing human performance patterns with model behaviors across visual\u2013linguistic tasks, we demonstrate that conventional static tokenization schemes fundamentally constrain current models\u2019 capacity to simulate the dynamic, context-sensitive nature of human information processing. We propose a novel framework for dynamic cross-modal tokenization that incorporates adaptive boundaries, hierarchical representations, and alignment mechanisms grounded in cognitive science principles. Quantitative evaluations demonstrate that our approach yields statistically significant improvements over state-of-the-art models on benchmark tasks (+7.8% on Visual Question Answering (p &lt; 0.001), 5.3% on Complex Scene Description) while exhibiting more human-aligned error patterns and attention distributions. These findings contribute to the theoretical understanding of the relationship between human cognition and artificial intelligence, while providing empirical evidence for developing more cognitively plausible AI systems.<\/jats:p>","DOI":"10.3390\/info16121106","type":"journal-article","created":{"date-parts":[[2025,12,15]],"date-time":"2025-12-15T15:15:08Z","timestamp":1765811708000},"page":"1106","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Adaptive Token Boundaries: Towards Integrating Human Chunking Mechanisms into Multimodal LLMs"],"prefix":"10.3390","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5976-1782","authenticated-orcid":false,"given":"Dongxing","family":"Yu","sequence":"first","affiliation":[{"name":"School of Education, Sanda University, Shanghai 314100, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,12,15]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"81","DOI":"10.1037\/h0043158","article-title":"The magical number seven, plus or minus two: Some limits on our capacity for processing information","volume":"63","author":"Miller","year":"1956","journal-title":"Psychol. Rev."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"372","DOI":"10.1037\/0033-2909.124.3.372","article-title":"Eye movements in reading and information processing: 20 years of research","volume":"124","author":"Rayner","year":"1998","journal-title":"Psychol. Bull."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"829","DOI":"10.1038\/nrn1201","article-title":"Working memory: Looking back and looking forward","volume":"4","author":"Baddeley","year":"2003","journal-title":"Nat. Rev. Neurosci."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"2425","DOI":"10.1126\/science.1063736","article-title":"Distributed and overlapping representations of faces and objects in ventral temporal cortex","volume":"293","author":"Haxby","year":"2001","journal-title":"Science"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"55","DOI":"10.1016\/0010-0285(73)90004-2","article-title":"Perception in chess","volume":"4","author":"Chase","year":"1973","journal-title":"Cogn. Psychol."},{"key":"ref_6","first-page":"23716","article-title":"Flamingo: A visual language model for few-shot learning","volume":"35","author":"Alayrac","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_7","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 18\u201324). Learning transferable visual models from natural language supervision. Proceedings of the International Conference on Machine Learning, Virtual."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"96","DOI":"10.1038\/s42256-024-00963-y","article-title":"Visual cognition in multimodal large language models","volume":"7","author":"Buschoff","year":"2025","journal-title":"Nat. Mach. Intell."},{"key":"ref_9","unstructured":"Li, J., Li, D., Xiong, C., and Hoi, S.C.H. (2022, January 17\u201323). BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA."},{"key":"ref_10","first-page":"12449","article-title":"wav2vec 2.0: A framework for self-supervised learning of speech representations","volume":"33","author":"Baevski","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_11","unstructured":"Marcus, G. (2018). Deep learning: A critical appraisal. arXiv."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"E6256","DOI":"10.1073\/pnas.1612132113","article-title":"Neural correlate of the construction of sentence meaning","volume":"113","author":"Fedorenko","year":"2016","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1016\/j.cortex.2017.02.004","article-title":"Online neural monitoring of statistical learning","volume":"90","author":"Batterink","year":"2017","journal-title":"Cortex"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Sennrich, R., Haddow, B., and Birch, A. (2016, January 7\u201312). Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany.","DOI":"10.18653\/v1\/P16-1162"},{"key":"ref_15","unstructured":"Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and Sutskever, I. (2021, January 18\u201324). Zero-shot text-to-image generation. Proceedings of the International Conference on Machine Learning, Virtual."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"279","DOI":"10.1177\/0963721415570732","article-title":"Words and the world: Predictive coding and the language-perception-cognition interface","volume":"24","author":"Lupyan","year":"2015","journal-title":"Curr. Dir. Psychol. Sci."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"e253","DOI":"10.1017\/S0140525X16001837","article-title":"Building machines that learn and think like people","volume":"40","author":"Lake","year":"2017","journal-title":"Behav. Brain Sci."},{"key":"ref_18","unstructured":"Doshi-Velez, F., and Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"1279","DOI":"10.1126\/science.1192788","article-title":"How to grow a mind: Statistics, structure, and abstraction","volume":"331","author":"Tenenbaum","year":"2011","journal-title":"Science"},{"key":"ref_20","unstructured":"Chollet, F. (2019). On the measure of intelligence. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"111","DOI":"10.1038\/s41592-018-0235-4","article-title":"fMRIPrep: A robust preprocessing pipeline for functional MRI","volume":"16","author":"Esteban","year":"2019","journal-title":"Nat. Methods"},{"key":"ref_22","unstructured":"Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., and Altman, S. (2023). GPT-4 technical report. arXiv."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., and Parikh, D. (2015, January 7\u201313). VQA: Visual question answering. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.279"},{"key":"ref_24","unstructured":"Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Doll\u00e1r, P., and Zitnick, C.L. (2015). Microsoft COCO captions: Data collection and evaluation server. arXiv."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Hudson, D.A., and Manning, C.D. (2019, January 16\u201320). GQA: A new dataset for real-world visual reasoning and compositional question answering. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00686"},{"key":"ref_26","unstructured":"Belghazi, M.I., Barber, A., Baez, S., Charlin, L., and Courville, A. (2018, January 10\u201315). Mutual Information Neural Estimation. Proceedings of the In-ternational Conference on Machine Learning, Stockholm, Sweden."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Linzen, T. (2020, January 5\u201310). How can we accelerate progress towards human-like linguistic generalization?. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Seattle, WA, USA.","DOI":"10.18653\/v1\/2020.acl-main.465"}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/16\/12\/1106\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,17]],"date-time":"2025-12-17T10:53:35Z","timestamp":1765968815000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/16\/12\/1106"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,15]]},"references-count":27,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["info16121106"],"URL":"https:\/\/doi.org\/10.3390\/info16121106","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12,15]]}}}