{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T16:42:55Z","timestamp":1777048975732,"version":"3.51.4"},"reference-count":49,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2021,2,9]],"date-time":"2021-02-09T00:00:00Z","timestamp":1612828800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100004919","name":"King Abdulaziz City for Science and Technology","doi-asserted-by":"publisher","award":["3-17-09-001-0003"],"award-info":[{"award-number":["3-17-09-001-0003"]}],"id":[{"id":"10.13039\/501100004919","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>This study proposes using object detection techniques to recognize sequences of articulatory features (AFs) from speech utterances by treating AFs of phonemes as multi-label objects in speech spectrogram. The proposed system, called AFD-Obj, recognizes sequence of multi-label AFs in speech signal and localizes them. AFD-Obj consists of two main stages: firstly, we formulate the problem of AFs detection as an object detection problem and prepare the data to fulfill requirement of object detectors by generating a spectral three-channel image from the speech signal and creating the corresponding annotation for each utterance. Secondly, we use annotated images to train the proposed system to detect sequences of AFs and their boundaries. We test the system by feeding spectrogram images to the system, which will recognize and localize multi-label AFs. We investigated using these AFs to detect the utterance phonemes. YOLOv3-tiny detector is selected because of its real-time property and its support for multi-label detection. We test our AFD-Obj system on Arabic and English languages using KAPD and TIMIT corpora, respectively. Additionally, we propose using YOLOv3-tiny as an Arabic phoneme detection system (i.e., PD-Obj) to recognize and localize a sequence of Arabic phonemes from whole speech utterances. The proposed AFD-Obj and PD-Obj systems achieve excellent results for Arabic corpus and comparable to the state-of-the-art method for English corpus. Moreover, we showed that using only one-scale detection is suitable for AFs detection or phoneme recognition.<\/jats:p>","DOI":"10.3390\/s21041205","type":"journal-article","created":{"date-parts":[[2021,2,10]],"date-time":"2021-02-10T04:33:46Z","timestamp":1612931626000},"page":"1205","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":8,"title":["Deep Learning-Based Detection of Articulatory Features in Arabic and English Speech"],"prefix":"10.3390","volume":"21","author":[{"given":"Mohammed","family":"Algabri","sequence":"first","affiliation":[{"name":"Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia"},{"name":"Center of Smart Robotics Research (CS2R), College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia"}]},{"given":"Hassan","family":"Mathkour","sequence":"additional","affiliation":[{"name":"Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia"},{"name":"Center of Smart Robotics Research (CS2R), College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia"}]},{"given":"Mansour M.","family":"Alsulaiman","sequence":"additional","affiliation":[{"name":"Center of Smart Robotics Research (CS2R), College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia"},{"name":"Computer Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8147-8679","authenticated-orcid":false,"given":"Mohamed A.","family":"Bencherif","sequence":"additional","affiliation":[{"name":"Center of Smart Robotics Research (CS2R), College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia"},{"name":"Computer Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia"}]}],"member":"1968","published-online":{"date-parts":[[2021,2,9]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Pulverm\u00fcller, F., and Fadiga, L. (2016). Brain language mechanisms built on action and perception. Neurobiology of Language, Elsevier.","DOI":"10.1016\/B978-0-12-407794-2.00026-2"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1561\/2000000001","article-title":"Introduction to digital speech processing","volume":"1","author":"Rabiner","year":"2007","journal-title":"Found. Trends\u00ae Signal Process."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"81382","DOI":"10.1109\/ACCESS.2019.2924014","article-title":"Distinctive phonetic features modeling and extraction using deep neural networks","volume":"7","author":"Seddiq","year":"2019","journal-title":"IEEE Access"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"29","DOI":"10.1016\/j.specom.2019.06.003","article-title":"Anomaly detection based pronunciation verification approach using speech attribute features","volume":"111","author":"Shahin","year":"2019","journal-title":"Speech Commun."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"3549","DOI":"10.21437\/Interspeech.2019-1677","article-title":"Multimodal articulation-based pronunciation error detection with spectrogram and acoustic features","volume":"2019","author":"Jenne","year":"2019","journal-title":"Proc. Interspeech"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Cao, B., Kim, M.J., van Santen, J.P.H., Mau, T., and Wang, J. (2017, January 20\u201324). Integrating articulatory information in deep learning-based text-to-speech synthesis. Proceedings of the Interspeech, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-1762"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Yilmaz, E., Mitra, V., Bartels, C., and Franco, H. (2018). Articulatory features for ASR of pathological speech. arXiv.","DOI":"10.21437\/Interspeech.2018-67"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"1077","DOI":"10.1007\/s11265-018-1334-2","article-title":"Improving Mandarin tone recognition based on DNN by combining acoustic and articulatory features using extended recognition networks","volume":"90","author":"Lin","year":"2018","journal-title":"J. Signal Process. Syst."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"1089","DOI":"10.1109\/JPROC.2013.2238591","article-title":"An information-extraction approach to speech processing: Analysis, detection, verification, and recognition","volume":"101","author":"Lee","year":"2013","journal-title":"Proc. IEEE"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Behravan, H., Hautama, V., Siniscalchi, S.M., Kinnunen, T., and Lee, C.-H. (2014, January 4\u20139). Introducing attribute features to foreign accent recognition. Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy.","DOI":"10.1109\/ICASSP.2014.6854621"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"875","DOI":"10.1109\/TASL.2011.2167610","article-title":"Experiments on cross-language attribute detection and phone recognition with minimal target-specific training data","volume":"20","author":"Siniscalchi","year":"2011","journal-title":"IEEE Trans. Audio. Speech. Lang. Process."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Wang, H., Zhao, Y., Xu, Y., Xu, X., Suo, X., and Ji, Q. (2014, January 12\u201314). Cross-language speech attribute detection and phone recognition for Tibetan using deep learning. Proceedings of the the 9th International Symposium on Chinese Spoken Language Processing, Singapore.","DOI":"10.1109\/ISCSLP.2014.6936682"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"54663","DOI":"10.1109\/ACCESS.2020.2980452","article-title":"Towards deep object detection techniques for phoneme recognition","volume":"8","author":"Algabri","year":"2020","journal-title":"IEEE Access"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"1426","DOI":"10.3906\/elk-1112-29","article-title":"Review of distinctive phonetic features and the Arabic share in related modern research","volume":"21","author":"Alotaibi","year":"2013","journal-title":"Turk. J. Electr. Eng. Comput. Sci."},{"key":"ref_15","first-page":"5633","article-title":"Automatic speech attribute detection of arabic language","volume":"13","author":"Aljohani","year":"2018","journal-title":"Int. J. Appl. Eng. Res."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"333","DOI":"10.1006\/csla.2000.0148","article-title":"Detection of phonological features in continuous speech using neural networks","volume":"14","author":"King","year":"2000","journal-title":"Comput. Speech Lang."},{"key":"ref_17","unstructured":"Chomsky, N., and Halle, M. (1968). The Sound Pattern of English, MIT Press."},{"key":"ref_18","unstructured":"Harris, J. (1994). English Sound Structure, Blackwell."},{"key":"ref_19","unstructured":"Hou, J., Rabiner, L., and Dusan, S. (2006, January 14\u201319). Automatic speech attribute transcription (asat)-the front end processor. Proceedings of the 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Toulouse, France."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016, January 27\u201330). You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.91"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"9102","DOI":"10.1109\/ACCESS.2020.2964608","article-title":"Real-time apple detection system using embedded systems with hardware accelerators: An edge AI application","volume":"8","author":"Mazzia","year":"2020","journal-title":"IEEE Access"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Gong, H., Li, H., Xu, K., and Zhang, Y. (2019, January 27\u201330). Object detection based on improved YOLOv3-tiny. Proceedings of the 2019 Chinese Automation Congress (CAC), Auckland, New Zealand.","DOI":"10.1109\/CAC48633.2019.8996750"},{"key":"ref_23","unstructured":"Alexey, A.B. (2020, May 30). Windows and Linux Version of Darknet Yolo v3 & v2 Neural Networks for object detection. GitHub Repos. Available online: https:\/\/github.com\/joheras\/darknet-1."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"52025","DOI":"10.1088\/1755-1315\/440\/5\/052025","article-title":"An improved method of Tiny YOLOV3","volume":"440","author":"Gong","year":"2020","journal-title":"IOP Conf. Ser. Earth Environ. Sci."},{"key":"ref_25","unstructured":"Redmon, J. (2020, May 30). ImageNet Classification. Available online: https\/\/pjreddie.com\/darknet\/imagenet\/."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Redmon, J., and Farhadi, A. (2017, January 21\u201326). YOLO9000: Better, faster, stronger. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.690"},{"key":"ref_27","unstructured":"Redmon, J., and Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv."},{"key":"ref_28","unstructured":"Redmon, J., and Farhadi, A. (2020, September 06). Yolo: Real-Time Object Detection. Available online: https:\/\/pjreddie.com\/darknet\/yolo."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009, January 20\u201325). Imagenet: A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_30","unstructured":"He, K., Girshick, R., and Doll\u00e1r, P. (November, January 27). Rethinking imagenet pre-training. Proceedings of the Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea."},{"key":"ref_31","unstructured":"Xie, J., Ding, C., Li, W., and Cai, C. (2018). Audio-only bird species automated identification method with limited training data based on multi-channel deep convolutional neural networks. arXiv."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Amiriparian, S., Gerczuk, M., Ottl, S., Cummins, N., Freitag, M., Pugachevskiy, S., Baird, A., and Schuller, B.W. (2017, January 20\u201324). Snore Sound Classification Using Image-Based Deep Spectrum Features. Proceedings of the INTERSPEECH, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-434"},{"key":"ref_33","unstructured":"Redmon, J. (2018, August 23). Tiny Darknet. Available online: https:\/\/pjreddie.com\/darknet\/tiny-darknet\/."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Segal, Y., Fuchs, T.S., and Keshet, J. (2019). SpeechYOLO: Detection and Localization of Speech Objects. arXiv.","DOI":"10.21437\/Interspeech.2019-1749"},{"key":"ref_35","unstructured":"Warden, P. (2018). Speech commands: A dataset for limited-vocabulary speech recognition. arXiv."},{"key":"ref_36","unstructured":"Alghmadi, M. (2003, January 3\u20139). KACST arabic phonetic database. Proceedings of the the Fifteenth International Congress of Phonetics Science, Barcelona, Spain."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Seddiq, Y., Meftah, A., Alghamdi, M., and Alotaibi, Y. (2016, January 28\u201330). Reintroducing KAPD as a Dataset for Machine Learning and Data Mining Applications. Proceedings of the 2016 European Modelling Symposium (EMS), Pisa, Italy.","DOI":"10.1109\/EMS.2016.022"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Graves, A., Mohamed, A., and Hinton, G. (2013, January 26\u201331). Speech recognition with deep recurrent neural networks. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada.","DOI":"10.1109\/ICASSP.2013.6638947"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1109\/MSP.2012.2205597","article-title":"Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups","volume":"29","author":"Hinton","year":"2012","journal-title":"IEEE Signal Process. Mag."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Karaulov, I., and Tkanov, D. (2019). Attention model for articulatory features detection. arXiv.","DOI":"10.21437\/Interspeech.2019-3020"},{"key":"ref_41","first-page":"27403","article-title":"DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1","volume":"93","author":"Garofolo","year":"1993","journal-title":"NASA STI\/Recon Technol. Rep."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Hwang, M.J., and Kang, H.G. (2019, January 15\u201319). Parameter enhancement for MELP speech codec in noisy communication environment. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Graz, Austria.","DOI":"10.21437\/Interspeech.2019-3249"},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"113","DOI":"10.1016\/j.ins.2013.07.007","article-title":"An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics","volume":"250","author":"Garcia","year":"2013","journal-title":"Inf. Sci."},{"key":"ref_44","first-page":"75","article-title":"The HTK book","volume":"3","author":"Young","year":"2006","journal-title":"Camb. Univ. Eng. Dep."},{"key":"ref_45","unstructured":"Ramachandran, P., Zoph, B., and Le, Q.V. (2017). Searching for activation functions. arXiv."},{"key":"ref_46","unstructured":"De Andrade, D.C., Leo, S., Viana, M.L.D.S., and Bernkopf, C. (2018). A neural attention model for speech command recognition. arXiv."},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"285","DOI":"10.1109\/JSTSP.2019.2909479","article-title":"Comparison and analysis of SampleCNN architectures for audio classification","volume":"13","author":"Kim","year":"2019","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"ref_48","unstructured":"Kim, T., and Nam, J. (2019). Temporal feedback convolutional recurrent neural networks for keyword spotting. arXiv."},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"1269","DOI":"10.3813\/AAA.919404","article-title":"A canonicalization of distinctive phonetic features to improve arabic speech recognition","volume":"105","author":"Alotaibi","year":"2019","journal-title":"Acta Acust. United Acust."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/4\/1205\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T05:21:50Z","timestamp":1760160110000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/4\/1205"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,2,9]]},"references-count":49,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2021,2]]}},"alternative-id":["s21041205"],"URL":"https:\/\/doi.org\/10.3390\/s21041205","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,2,9]]}}}