{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T00:39:13Z","timestamp":1776991153705,"version":"3.51.4"},"reference-count":44,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,1,15]],"date-time":"2024-01-15T00:00:00Z","timestamp":1705276800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2024,1,31]]},"abstract":"<jats:p>Advanced Neural Networks are widely used to recognize multi-modal conversational speech with significant improvements in accuracy automatically. Significantly, Convolutional Neural sheets have retreated cutting-edge performance in Automatic Voice Recognition (AVR) recently more appropriately in English; however, the Hindi language has not been explored and examined well on AVR systems. The work in this article has exposed a three-layered two-dimensional Sequential Convolutional neural architecture. The Sequential Conv2D is an end-to-end system that can instantaneously exploit speech signal spectral and temporal structures. The network has been trained and tested on different cepstral features such as Frequency and Time variant-Mel-Filters, Gamma-tone Filter Cepstral Quantities, Bark-Filter band Coefficients, and Spectrogram features of speech structures. The experiment was performed on two low-resourced speech command datasets; Hindi with 27,145 Speech Keywords developed by TIFR and 23,664 (1-s utterances) of English speech commands by Google TensorFlow and AIY English Speech Commands. The experimental outcome showed that the model achieves significant performance of Convolutional layers trained on spectrograms with 91.60% accuracy, compared to that achieved in other cepstral feature labels for English speech. However, the model achieved an accuracy of 69.65% for Hindi audio words in which bark-frequency cepstral coefficients features outperformed spectrogram features.<\/jats:p>","DOI":"10.1145\/3606019","type":"journal-article","created":{"date-parts":[[2023,7,24]],"date-time":"2023-07-24T13:01:50Z","timestamp":1690203710000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["End-to-end Multi-modal Low-resourced Speech Keywords Recognition Using Sequential Conv2D Nets"],"prefix":"10.1145","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9299-1128","authenticated-orcid":false,"given":"Pooja","family":"Gambhir","sequence":"first","affiliation":[{"name":"Department of Information Technology, Indira Gandhi Delhi Technical University for Women, Kashmere Gate, New Delhi, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6926-9433","authenticated-orcid":false,"given":"Amita","family":"Dev","sequence":"additional","affiliation":[{"name":"Department of Information Technology, Indira Gandhi Delhi Technical University for Women, Kashmere Gate, New Delhi, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8214-2840","authenticated-orcid":false,"given":"Poonam","family":"Bansal","sequence":"additional","affiliation":[{"name":"Department of Information Technology, Indira Gandhi Delhi Technical University for Women, Kashmere Gate, New Delhi, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6117-3464","authenticated-orcid":false,"given":"Deepak Kumar","family":"Sharma","sequence":"additional","affiliation":[{"name":"Department of Information Technology, Indira Gandhi Delhi Technical University for Women, Kashmere Gate, New Delhi, India"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,1,15]]},"reference":[{"key":"e_1_3_1_2_2","volume-title":"Proceedings of the Conference on Artificial Intelligence and Speech Technology","author":"Gambhir P.","year":"2019","unstructured":"P. Gambhir. 2019. Review of Chatbot design and trends. In Proceedings of the Conference on Artificial Intelligence and Speech Technology."},{"key":"e_1_3_1_3_2","doi-asserted-by":"crossref","unstructured":"M. Chellapriyadharshini A. Toffy and V. Ramasubramanian. 2018. Semi-supervised and active-learning scenarios: Efficient acoustic model refinement for a low resource Indian language. Retrieved from https:\/\/arXiv:1810.06635","DOI":"10.21437\/Interspeech.2018-2486"},{"key":"e_1_3_1_4_2","author":"Shamsfard M.","unstructured":"M. Shamsfard. 2019. Challenges and opportunities in processing low resource languages: A study on Persian. In International Conference Language Technologies for All (LT4All).","journal-title":"International Conference Language Technologies for All (LT4All)"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.14569\/IJACSA.2020.0110455"},{"key":"e_1_3_1_6_2","first-page":"195","volume-title":"Proceedings of the Computer Society of India (CSI\u201915), Speech and Language Processing for Human-Machine Communications, Advances in Intelligent Systems and Computing","author":"Bansal Poonam","year":"2015","unstructured":"Poonam Bansal et al. 2015. The State-of-the-art of feature extraction techniques: An overview. In Proceedings of the Computer Society of India (CSI\u201915), Speech and Language Processing for Human-Machine Communications, Advances in Intelligent Systems and Computing. Springer, 195\u2013207."},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2012.6288979"},{"key":"e_1_3_1_8_2","first-page":"1","article-title":"Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation","author":"Kumar A.","year":"2020","unstructured":"A. Kumar and R. K. Aggarwal. 2020. Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation. Int. J. Speech. Technol. (2020), 1\u201312.","journal-title":"Int. J. Speech. Technol."},{"key":"e_1_3_1_9_2","volume-title":"Handbook of Medical Image Computing and Computer Assisted Intervention","author":"Zhou S. K.","year":"2019","unstructured":"S. K. Zhou, D. Rueckert, and G. Fichtinger (Eds.). 2019. Handbook of Medical Image Computing and Computer Assisted Intervention. Academic Press."},{"issue":"1","key":"e_1_3_1_10_2","first-page":"1261","article-title":"A hybrid of deep CNN and bidirectional LSTM for automatic speech recognition","volume":"29","author":"Passricha V.","year":"2019","unstructured":"V. Passricha and R. K. Aggarwal. 2019. A hybrid of deep CNN and bidirectional LSTM for automatic speech recognition. J. Intell. Syst. 29, 1 (2019), 1261\u20131274.","journal-title":"J. Intell. Syst."},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/SLT.2018.8639629"},{"key":"e_1_3_1_12_2","unstructured":"W. Chan N. Jaitly Q. V. Le and O. Vinyals. 2015. Listen attend and spell. Retrieved from https:\/\/arXiv:1508.01211"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASRU.2017.8268937"},{"key":"e_1_3_1_14_2","unstructured":"R. Collobert C. Puhrsch and G. Synnaeve. 2016. Wav2letter: An end-to-end convnet-based speech recognition system. Retrieved from https:\/\/arXiv:1609.03193"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2014.2339736"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/SIPROCESS.2016.7888355"},{"key":"e_1_3_1_17_2","unstructured":"P. Jansson. 2018. Single-word speech recognition with convolutional neural networks on raw waveforms. https:\/\/scholar.google.com\/scholar?hl=en&as_sdt=0%2C5&q=P.+Jansson.+2018.+Single-word+speech+recognition+with+convolutional+neural+networks+on+raw+waveforms.&btnG="},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1186\/s13195-021-00888-3"},{"issue":"4","key":"e_1_3_1_19_2","article-title":"Speech recognition using convolutional neural networks","volume":"7","author":"Nagajyothi D.","year":"2018","unstructured":"D. Nagajyothi and P. Siddaiah. 2018. Speech recognition using convolutional neural networks. Int. J. Eng. Technol. 7, 4.6 (2018).","journal-title":"Int. J. Eng. Technol."},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/SIPROCESS.2016.7888355"},{"key":"e_1_3_1_21_2","doi-asserted-by":"crossref","unstructured":"Y. Zhang M. Pezeshki P. Brakel S. Zhang C. L. Y. Bengio and A. Courville. 2017. Towards end-to-end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:1701.02720.","DOI":"10.21437\/Interspeech.2016-1446"},{"key":"e_1_3_1_22_2","author":"Li Xuejiao","unstructured":"Xuejiao Li and Zixuan Zhou. 2017. Speech command recognition with convolutional neural network. CS229 Stanford Education, Vol. 31.","journal-title":"CS229 Stanford Education"},{"key":"e_1_3_1_23_2","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'15)","author":"Huang J. T.","unstructured":"J. T. Huang, J. Li, and Y. Gong. 2015. An analysis of convolutional neural networks for speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'15). IEEE, 4989--4993."},{"issue":"1","key":"e_1_3_1_24_2","first-page":"55","article-title":"Normalized autocorrelation-based features for robust speech recognition in context with noisy environment","volume":"6","author":"Bansal Poonam","year":"2011","unstructured":"Poonam Bansal, Amita Dev, and Shail Bala Jain. 2011. Normalized autocorrelation-based features for robust speech recognition in context with noisy environment. J. Inf. Comput. Sci. 6, 1 (2011), 55\u201363.","journal-title":"J. Inf. Comput. Sci."},{"issue":"8","key":"e_1_3_1_25_2","first-page":"36","article-title":"Robust features for noisy speech recognition using MFCC computation from magnitude spectrum of higher order autocorrelation coefficients","volume":"10","author":"Bansal Poonam","year":"2010","unstructured":"Poonam Bansal, Amita Dev, and Shail Bala Jain. 2010. Robust features for noisy speech recognition using MFCC computation from magnitude spectrum of higher order autocorrelation coefficients. Int. J. Comput. Appl. 10, 8 (2010), 36\u201338.","journal-title":"Int. J. Comput. Appl."},{"key":"e_1_3_1_26_2","doi-asserted-by":"crossref","unstructured":"Pavel Golik Zolt\u00e1n T\u00fcske Ralf Schl\u00fcter and Hermann Ney. 2015. Convolutional neural networks for acoustic modeling of raw time signal in LVCSR. Retrieved from https:\/\/www.isca-speech.org\/archive\/interspeech_2015\/golik15_interspeech.html","DOI":"10.21437\/Interspeech.2015-6"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2014.2339736"},{"key":"e_1_3_1_28_2","author":"Bhable S.","unstructured":"S. Bhable, A. Lahase, and S. Maher. 2021. Automatic speech recognition (ASR) of isolated words in Hindi low resource language. Int. J. Res. Appl. Sci. Eng. Technol. 9, 2 (2021), 260--265.","journal-title":"Int. J. Res. Appl. Sci. Eng. Technol."},{"key":"e_1_3_1_29_2","unstructured":"A. G. Howard M. Zhu B. Chen D. Kalenichenko W. Wang T. Weyand and H. Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. Retrieved from https:\/\/arXiv:1704.04861"},{"issue":"3","key":"e_1_3_1_30_2","first-page":"1","article-title":"Deep learning convolutional neural network for speech recognition: A review","volume":"5","author":"Taher K. I.","year":"2021","unstructured":"K. I. Taher and A. M. Abdulazeez. 2021. Deep learning convolutional neural network for speech recognition: A review. Int. J. Sci. Bus. 5, 3 (2021), 1\u201314.","journal-title":"Int. J. Sci. Bus."},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2011.2134090"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2016.7472684"},{"key":"e_1_3_1_33_2","article-title":"Speech command recognition with convolutional neural network","author":"Li X.","year":"2017","unstructured":"X. Li and Z. Zhou. 2017. Speech command recognition with convolutional neural network. In CS229 Stanford Education, Vol. 31. Retrieved from http:\/\/cs229.stanford.edu\/proj2017\/final-reports\/5244201.pdf","journal-title":"CS229 Stanford Education"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2019.01.004"},{"key":"e_1_3_1_35_2","volume-title":"Speech Commands: A public Dataset for Single-Word Speech Recognition","author":"Warden P.","year":"2017","unstructured":"P. Warden. 2017. Speech Commands: A public Dataset for Single-Word Speech Recognition. Tillg\u00e4nglig. Retrieved from http:\/\/download.tensorflow.org\/data\/speech_commands_v0.01.tar.gz"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.5120\/7184-9893"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2014.2325781"},{"issue":"1","key":"e_1_3_1_38_2","article-title":"MFCC and prosodic feature extraction techniques: A comparative study","volume":"54","author":"Singh N.","year":"2012","unstructured":"N. Singh, R. A. Khan, and R. Shree. 2012. MFCC and prosodic feature extraction techniques: A comparative study. Int. J. Comput. Appl. 54, 1 (2012).","journal-title":"Int. J. Comput. Appl."},{"key":"e_1_3_1_39_2","author":"Dhingra S. D.","unstructured":"S. D. Dhingra, G. Nijhawan, and P. Pandit. 2013. Isolated speech recognition using MFCC and DTW. International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering 2, 8 (2013), 4085--4092.","journal-title":"International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering"},{"key":"e_1_3_1_40_2","volume-title":"Proceedings of the Speech-Group Meeting of the Institute of Acoustics on Auditory Modelling","volume":"54","author":"Patterson R.","year":"1996","unstructured":"R. Patterson and I. N. Smith. 1996. An efficient auditory filter bank based on the gamma-tone-function. In Proceedings of the Speech-Group Meeting of the Institute of Acoustics on Auditory Modelling, vol. 54."},{"key":"e_1_3_1_41_2","volume-title":"Proceedings of the IOC Speech Group on Auditory Modelling at RSRE","author":"Patterson R. D","year":"1987","unstructured":"R. D Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice. 1987. An efficient auditory filter bank based on Gammatone function. In Proceedings of the IOC Speech Group on Auditory Modelling at RSRE."},{"key":"e_1_3_1_42_2","doi-asserted-by":"crossref","unstructured":"M. Jeevan A. Dhingra M. Hanmandlu and B. K. Panigrahi. 2017. Robust speaker verification using GFCC based i-vectors. In Proceedings of the International Conference on Signal Networks Computing and Systems (ICSNCS'16) Vol. 1 Springer 85--91.","DOI":"10.1007\/978-81-322-3592-7_9"},{"issue":"2","key":"e_1_3_1_43_2","first-page":"14","article-title":"A robust BFCC feature extraction for ASR system","volume":"5","author":"Kuan T. W.","year":"2016","unstructured":"T. W. Kuan, A. C. Tsai, P. H. Sung, J. F. Wang, and H. S. Kuo. 2016. A robust BFCC feature extraction for ASR system. Artif. Intell. Res. 5, 2 (2016), 14\u201323.","journal-title":"Artif. Intell. Res."},{"issue":"12","key":"e_1_3_1_44_2","first-page":"22","article-title":"Comparative analysis of LPCC, MFCC and BFCC for the recognition of Hindi words using artificial neural networks","volume":"101","author":"Gulzar T.","year":"2014","unstructured":"T. Gulzar, A. Singh, and S. Sharma. 2014. Comparative analysis of LPCC, MFCC and BFCC for the recognition of Hindi words using artificial neural networks. Int. J. Comput. Appl. 101, 12 (2014), 22\u201327.","journal-title":"Int. J. Comput. Appl."},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2014.2339736"}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3606019","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3606019","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:36:20Z","timestamp":1750178180000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3606019"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,15]]},"references-count":44,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,1,31]]}},"alternative-id":["10.1145\/3606019"],"URL":"https:\/\/doi.org\/10.1145\/3606019","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"value":"2375-4699","type":"print"},{"value":"2375-4702","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,1,15]]},"assertion":[{"value":"2022-09-21","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-05-21","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-01-15","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}