{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,27]],"date-time":"2026-06-27T00:54:14Z","timestamp":1782521654144,"version":"3.54.5"},"reference-count":20,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2024,10,25]],"date-time":"2024-10-25T00:00:00Z","timestamp":1729814400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["J. Imaging"],"abstract":"<jats:p>This study introduces a novel approach for the diagnosis of Cleft Lip and\/or Palate (CL\/P) by integrating Vision Transformers (ViTs) and Siamese Neural Networks. Our study is the first to employ this integration specifically for CL\/P classification, leveraging the strengths of both models to handle complex, multimodal data and few-shot learning scenarios. Unlike previous studies that rely on single-modality data or traditional machine learning models, we uniquely fuse anatomical data from ultrasound images with functional data from speech spectrograms. This multimodal approach captures both structural and acoustic features critical for accurate CL\/P classification. Employing Siamese Neural Networks enables effective learning from a small number of labeled examples, enhancing the model\u2019s generalization capabilities in medical imaging contexts where data scarcity is a significant challenge. The models were tested on the UltraSuite CLEFT dataset, which includes ultrasound video sequences and synchronized speech data, across three cleft types: Bilateral, Unilateral, and Palate-only clefts. The two-stage model demonstrated superior performance in classification accuracy (82.76%), F1-score (80.00\u201386.00%), precision, and recall, particularly distinguishing Bilateral and Unilateral Cleft Lip and Palate with high efficacy. This research underscores the significant potential of advanced AI techniques in medical diagnostics, offering valuable insights into their application for improving clinical outcomes in patients with CL\/P.<\/jats:p>","DOI":"10.3390\/jimaging10110271","type":"journal-article","created":{"date-parts":[[2024,10,25]],"date-time":"2024-10-25T03:46:04Z","timestamp":1729827964000},"page":"271","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":10,"title":["Cleft Lip and Palate Classification Through Vision Transformers and Siamese Neural Networks"],"prefix":"10.3390","volume":"10","author":[{"given":"Oraphan","family":"Nantha","sequence":"first","affiliation":[{"name":"School of Information Technology, Sripatum University, Bangkok 10900, Thailand"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Benjaporn","family":"Sathanarugsawait","sequence":"additional","affiliation":[{"name":"School of Information Technology, Sripatum University, Bangkok 10900, Thailand"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Prasong","family":"Praneetpolgrang","sequence":"additional","affiliation":[{"name":"School of Information Technology, Sripatum University, Bangkok 10900, Thailand"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2024,10,25]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"167","DOI":"10.1038\/nrg2933","article-title":"Cleft lip and palate: Understanding genetic and environmental influences","volume":"12","author":"Dixon","year":"2011","journal-title":"Nat. Rev. Genet."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"436","DOI":"10.1038\/nature14539","article-title":"Deep learning","volume":"521","author":"LeCun","year":"2015","journal-title":"Nature"},{"key":"ref_3","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16 \u00d7 16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_4","unstructured":"Koch, G., Zemel, R., and Salakhutdinov, R. (2015, January 6\u201311). Siamese Neural Networks for One-shot Image Recognition. Proceedings of the ICML Deep Learning Workshop, Lille, France."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Eshky, A., Ribeiro, M.S., Clel, J., Richmond, K., Roxburgh, Z., Scobbie, J., and Wrench, A. (2019). UltraSuite: A repository of ultrasound and acoustic data from child speech therapy sessions. arXiv.","DOI":"10.21437\/Interspeech.2018-1736"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.ijmedinf.2019.05.023","article-title":"HypernasalityNet: Deep Recurrent Neural Network for Automatic Hypernasality Detection","volume":"129","author":"Wang","year":"2019","journal-title":"Int. J. Med. Inform."},{"key":"ref_7","unstructured":"Maier, A., N\u00f6th, E., Batliner, A., Nkenke, E., and Schuster, M. (2006, January 17\u201321). Fully Automatic Assessment of Speech of Children with Cleft Lip and Palate. Proceedings of the International Conference on Spoken Language Processing, Pittsburgh, PA, USA."},{"key":"ref_8","unstructured":"Zhu, J., Styler, W., and Calloway, I. (2019). A CNN-based tool for automatic tongue contour tracking in ultrasound images. arXiv."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Csap\u00f3, T.G., Gosztolya, G., T\u00f3th, L., Shandiz, A.H., and Mark\u00f3, A. (2022). Optimizing the Ultrasound Tongue Image Representation for Residual Network-Based Articulatory-to-Acoustic Mapping. Sensors, 22.","DOI":"10.3390\/s22228601"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Al-hammuri, K., Gebali, F., Thirumarai Chelvan, I., and Kanan, A. (2022). Tongue Contour Tracking and Segmentation in Lingual Ultrasound for Speech Recognition: A Review. Diagnostics, 12.","DOI":"10.3390\/diagnostics12112811"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"68","DOI":"10.1597\/1545-1569_2001_038_0068_dccfaa_2.0.co_2","article-title":"Different Cleft Conditions, Facial Appearance, and Speech: Relationship to Psychological Variables","volume":"38","author":"Millard","year":"2001","journal-title":"Cleft-Palate-Craniofac. J."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"331","DOI":"10.3109\/13682829609031326","article-title":"Characteristics of Cleft Palate Speech","volume":"31","author":"Harding","year":"1996","journal-title":"Int. J. Lang. Commun. Disord."},{"key":"ref_13","unstructured":"Arasteh, S.T., Arias-Vergara, T., P\u00e9rez-Toro, P.A., Weise, T., Packhaeuser, K., Schuster, M., Noeth, E., Maier, A., and Yang, S.H. (2024). The Impact of Speech Anonymization on Pathology and Its Limits. arXiv."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"168","DOI":"10.1016\/j.jbi.2016.10.007","article-title":"Semi-supervised learning of the electronic health record for phenotype stratification","volume":"64","author":"Greene","year":"2016","journal-title":"J. Biomed. Inform."},{"key":"ref_15","first-page":"32","article-title":"Deep learning reinvents the hearing aid","volume":"52","author":"Wang","year":"2015","journal-title":"IEEE Spectr."},{"key":"ref_16","unstructured":"Lu, J., Zhang, Y., and Liu, Z. (2022, January 10\u201314). BiomedCLIP: Contrastive Learning for Biomedical Image and Text Pairs. Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Zhou, Z.H. (2012). Ensemble Methods: Foundations and Algorithms, Chapman and Hall, CRC.","DOI":"10.1201\/b12207"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"2568","DOI":"10.1177\/1460458220911789","article-title":"Cleft prediction before birth using deep neural network","volume":"26","author":"Shafi","year":"2020","journal-title":"Health Inform. J."},{"key":"ref_19","unstructured":"Mamedov, T., and Bluhme, J. (2021). Exploring Deep Learning Approaches to Cleft Lip and Palate Speech: Can Neural Networks Be Used to Accurately Evaluate Velopharyngeal Competence?. [Master\u2019s Thesis, Lund University]."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Kuwada, C., Ariji, Y., Kise, Y., Fujita, H., Katsumata, A., and Ariji, E. (2021). Detection and classification of unilateral cleft alveolus with and without cleft palate on panoramic radiographs using a deep learning system. Sci. Rep., 11.","DOI":"10.1038\/s41598-021-95653-9"}],"container-title":["Journal of Imaging"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2313-433X\/10\/11\/271\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T16:20:13Z","timestamp":1760113213000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2313-433X\/10\/11\/271"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10,25]]},"references-count":20,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2024,11]]}},"alternative-id":["jimaging10110271"],"URL":"https:\/\/doi.org\/10.3390\/jimaging10110271","relation":{},"ISSN":["2313-433X"],"issn-type":[{"value":"2313-433X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,10,25]]}}}