{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T17:29:03Z","timestamp":1777656543645,"version":"3.51.4"},"reference-count":83,"publisher":"MDPI AG","issue":"10","license":[{"start":{"date-parts":[[2021,5,14]],"date-time":"2021-05-14T00:00:00Z","timestamp":1620950400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Audio signal classification finds various applications in detecting and monitoring health conditions in healthcare. Convolutional neural networks (CNN) have produced state-of-the-art results in image classification and are being increasingly used in other tasks, including signal classification. However, audio signal classification using CNN presents various challenges. In image classification tasks, raw images of equal dimensions can be used as a direct input to CNN. Raw time-domain signals, on the other hand, can be of varying dimensions. In addition, the temporal signal often has to be transformed to frequency-domain to reveal unique spectral characteristics, therefore requiring signal transformation. In this work, we overview and benchmark various audio signal representation techniques for classification using CNN, including approaches that deal with signals of different lengths and combine multiple representations to improve the classification accuracy. Hence, this work surfaces important empirical evidence that may guide future works deploying CNN for audio signal classification purposes.<\/jats:p>","DOI":"10.3390\/s21103434","type":"journal-article","created":{"date-parts":[[2021,5,17]],"date-time":"2021-05-17T02:31:34Z","timestamp":1621218694000},"page":"3434","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":33,"title":["Benchmarking Audio Signal Representation Techniques for Classification with Convolutional Neural Networks"],"prefix":"10.3390","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1079-8709","authenticated-orcid":false,"given":"Roneel V.","family":"Sharan","sequence":"first","affiliation":[{"name":"Australian Institute of Health Innovation, Macquarie University, Sydney, NSW 2109, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hao","family":"Xiong","sequence":"additional","affiliation":[{"name":"Australian Institute of Health Innovation, Macquarie University, Sydney, NSW 2109, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shlomo","family":"Berkovsky","sequence":"additional","affiliation":[{"name":"Australian Institute of Health Innovation, Macquarie University, Sydney, NSW 2109, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,5,14]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"81","DOI":"10.1145\/3214284","article-title":"A weakly supervised learning framework for detecting social anxiety and depression","volume":"2","author":"Salekin","year":"2018","journal-title":"Proc. ACM Interact. Mob. Wearable Ubiquitous Technol."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"665","DOI":"10.1016\/0967-0661(95)00042-S","article-title":"Electronic control of a wheelchair guided by voice commands","volume":"3","author":"Mazo","year":"1995","journal-title":"Control. Eng. Pract."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Bonet-Sol\u00e0, D., and Alsina-Pag\u00e8s, R.M. (2021). A comparative survey of feature extraction and machine learning methods in diverse acoustic environments. Sensors, 21.","DOI":"10.3390\/s21041274"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"485","DOI":"10.1109\/TBME.2018.2849502","article-title":"Automatic croup diagnosis using cough sound recognition","volume":"66","author":"Sharan","year":"2019","journal-title":"IEEE Trans. Biomed. Eng."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Ciregan, D., Meier, U., and Schmidhuber, J. (2012, January 16\u201321). Multi-column deep neural networks for image classification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA.","DOI":"10.1109\/CVPR.2012.6248110"},{"key":"ref_6","unstructured":"Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012, January 3\u20138). Imagenet classification with deep convolutional neural networks. Proceedings of the Advances in Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"279","DOI":"10.1109\/LSP.2017.2657381","article-title":"Deep convolutional neural networks and data augmentation for environmental sound classification","volume":"24","author":"Salamon","year":"2017","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_8","unstructured":"Soria, O.E., Mart\u00edn, G.J.D., Marcelino, M.-S., Rafael, M.-B.J., and Serrano, L.A.J. (2010). Transfer learning. Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, IGI Global."},{"key":"ref_9","first-page":"1929","article-title":"Dropout: A simple way to prevent neural networks from overfitting","volume":"15","author":"Srivastava","year":"2014","journal-title":"J. Mach. Learn. Res."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"300","DOI":"10.1016\/j.matdes.2018.11.060","article-title":"Using deep neural network with small dataset to predict material defects","volume":"162","author":"Feng","year":"2019","journal-title":"Mater. Des."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Liu, S., and Deng, W. (2015, January 3\u20136). Very deep convolutional neural network based image classification using small training sample size. Proceedings of the 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, Malaysia.","DOI":"10.1109\/ACPR.2015.7486599"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Ng, H.-W., Nguyen, V.D., Vonikakis, V., and Winkler, S. (2015, January 9\u201313). Deep learning for emotion recognition on small datasets using transfer learning. Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA.","DOI":"10.1145\/2818346.2830593"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"321","DOI":"10.1016\/j.neucom.2018.09.013","article-title":"GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification","volume":"321","author":"Diamant","year":"2018","journal-title":"Neurocomputing"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"22","DOI":"10.1016\/j.neucom.2016.03.020","article-title":"An overview of applications and advancements in automatic sound recognition","volume":"200","author":"Sharan","year":"2016","journal-title":"Neurocomputing"},{"key":"ref_15","unstructured":"Gerhard, D. (2003). Audio Signal Classification: History and Current Techniques, University of Regina. TR-CS 2003-07."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1733","DOI":"10.1109\/TMM.2015.2428998","article-title":"Detection and classification of acoustic scenes and events","volume":"17","author":"Stowell","year":"2015","journal-title":"IEEE Trans. Multimed."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Sainath, T.N., Mohamed, A., Kingsbury, B., and Ramabhadran, B. (2013, January 26\u201331). Deep convolutional neural networks for LVCSR. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada.","DOI":"10.1109\/ICASSP.2013.6639347"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1533","DOI":"10.1109\/TASLP.2014.2339736","article-title":"Convolutional neural networks for speech recognition","volume":"22","author":"Mohamed","year":"2014","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"1120","DOI":"10.1109\/LSP.2014.2325781","article-title":"Convolutional neural networks for distant speech recognition","volume":"21","author":"Swietojanski","year":"2014","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_20","unstructured":"Hertel, L., Phan, H., and Mertins, A. (2016). Classifying variable-length audio files with all-convolutional networks and masked global pooling. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Kumar, A., and Raj, B. (2017). Deep CNN framework for audio event recognition using weakly labeled web data. arXiv.","DOI":"10.1145\/2964284.2964310"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Hertel, L., Phan, H., and Mertins, A. (2016, January 24\u201329). Comparing time and frequency domain for audio event recognition using deep learning. Proceedings of the International Joint Conference on Neural Networks (IJCNN), Vancouver, CO, Canada.","DOI":"10.1109\/IJCNN.2016.7727635"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Golik, P., T\u00fcske, Z., Schl\u00fcter, R., and Ney, H. (2015, January 6\u201310). Convolutional neural networks for acoustic modeling of raw time signal in LVCSR. Proceedings of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH), Dresden, Germany.","DOI":"10.21437\/Interspeech.2015-6"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Sharan, R.V., Berkovsky, S., and Liu, S. (2020, January 20\u201324). Voice command recognition using biologically inspired time-frequency representation and convolutional neural networks. Proceedings of the 42nd Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Montreal, QC, Canada.","DOI":"10.1109\/EMBC44109.2020.9176006"},{"key":"ref_25","unstructured":"Becker, S., Ackermann, M., Lapuschkin, S., M\u00fcller, K.-R., and Samek, W. (2018). Interpreting and explaining deep neural networks for classification of audio signals. arXiv."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"917","DOI":"10.1007\/s10618-019-00619-1","article-title":"Deep learning for time series classification: A review","volume":"33","author":"Forestier","year":"2019","journal-title":"Data Min. Knowl. Discov."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"1034","DOI":"10.1109\/TSP.2018.2887403","article-title":"Convolutional neural network architectures for signals supported on graphs","volume":"67","author":"Gama","year":"2019","journal-title":"IEEE Trans. Signal Process."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"436","DOI":"10.1038\/nature14539","article-title":"Deep learning","volume":"521","author":"LeCun","year":"2015","journal-title":"Nature"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"1875","DOI":"10.1109\/TASLP.2020.2964959","article-title":"Sound events recognition and retrieval using multi-convolutional-channel sparse coding convolutional neural networks","volume":"28","author":"Wang","year":"2020","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Hershey, S., Chaudhuri, S., Ellis, D.P.W., Gemmeke, J.F., Jansen, A., Moore, R.C., Plakal, M., Platt, D., Saurous, R.A., and Seybold, B. (2017, January 5\u20139). CNN architectures for large-scale audio classification. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7952132"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"235","DOI":"10.1109\/TASSP.1977.1162950","article-title":"Short term spectral analysis, synthesis, and modification by discrete Fourier transform","volume":"25","author":"Allen","year":"1977","journal-title":"IEEE Trans. Acoust. Speech Signal Process."},{"key":"ref_32","unstructured":"Allen, J. (1982, January 3\u20135). Applications of the short time Fourier transform to speech processing and spectral analysis. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Paris, France."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Badshah, A.M., Ahmad, J., Rahim, N., and Baik, S.W. (2017, January 13\u201315). Speech emotion recognition from spectrograms with deep convolutional neural network. Proceedings of the International Conference on Platform Technology and Service (PlatCon), Busan, Korea.","DOI":"10.1109\/PlatCon.2017.7883728"},{"key":"ref_34","unstructured":"Brown, R.G. (2004). Smoothing, Forecasting and Prediction of Discrete Time Series, Dover Publications."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"44","DOI":"10.1016\/j.patrec.2017.09.023","article-title":"Increasing the robustness of CNN acoustic models using autoregressive moving average spectrogram features and channel dropout","volume":"100","author":"Ganapathy","year":"2017","journal-title":"Pattern Recognit. Lett."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"62","DOI":"10.1016\/j.apacoust.2018.12.006","article-title":"Acoustic event recognition using cochleagram image and convolutional neural networks","volume":"148","author":"Sharan","year":"2019","journal-title":"Appl. Acoust."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"357","DOI":"10.1109\/TASSP.1980.1163420","article-title":"Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences","volume":"28","author":"Davis","year":"1980","journal-title":"IEEE Trans. Acoust. Speech Signal Process."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"185","DOI":"10.1121\/1.1915893","article-title":"A scale for the measurement of the psychological magnitude pitch","volume":"8","author":"Stevens","year":"1937","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"379","DOI":"10.1109\/TASLP.2017.2778423","article-title":"Detection and classification of acoustic scenes and events: Outcome of the DCASE 2016 challenge","volume":"26","author":"Mesaros","year":"2018","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_40","unstructured":"Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., and Povey, D. (2009). The HTK Book (for HTK Version 3.4), Cambridge University Engineering Department."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Sharan, R.V., and Moir, T.J. (2019, January 16\u201318). Time-frequency image resizing using interpolation for acoustic event recognition with convolutional neural networks. Proceedings of the IEEE International Conference on Signals and Systems (ICSigSys), Bandung, Indonesia.","DOI":"10.1109\/ICSIGSYS.2019.8811088"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Tjandra, A., Sakti, S., Neubig, G., Toda, T., Adriani, M., and Nakamura, S. (2015, January 19\u201324). Combination of two-dimensional cochleogram and spectrogram features for deep learning-based ASR. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia.","DOI":"10.1109\/ICASSP.2015.7178827"},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"425","DOI":"10.1121\/1.400476","article-title":"Calculation of a constant Q spectral transform","volume":"89","author":"Brown","year":"1991","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_44","first-page":"142","article-title":"Histogram of gradients of time-frequency representations for audio scene classification","volume":"23","author":"Rakotomamonjy","year":"2015","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"1672","DOI":"10.1007\/s00034-019-01203-0","article-title":"Time-frequency feature fusion for noise robust audio event classification","volume":"39","author":"McLoughlin","year":"2020","journal-title":"Circuits Syst. Signal Process."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"505","DOI":"10.1016\/j.neucom.2017.07.021","article-title":"Noise robust sound event classification with convolutional neural network","volume":"272","author":"Ozer","year":"2018","journal-title":"Neurocomputing"},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"98","DOI":"10.1186\/s40537-019-0263-7","article-title":"Enlarging smaller images before inputting into convolutional neural network: Zero-padding vs. interpolation","volume":"6","author":"Hashemi","year":"2019","journal-title":"J. Big Data"},{"key":"ref_48","unstructured":"Reddy, D.M., and Reddy, N.V.S. (2019). Effects of padding on LSTMs and CNNs. arXiv."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Tang, H., Ortis, A., and Battiato, S. (2019, January 9\u201313). The impact of padding on image classification by using pre-trained convolutional neural networks. Proceedings of the 20th International Conference on Image Analysis and Processing (ICIAP), Trento, Italy.","DOI":"10.1007\/978-3-030-30645-8_31"},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"3397","DOI":"10.1109\/78.258082","article-title":"Matching pursuits with time-frequency dictionaries","volume":"41","author":"Mallat","year":"1993","journal-title":"IEEE Trans. Signal Process."},{"key":"ref_51","doi-asserted-by":"crossref","first-page":"209","DOI":"10.1109\/TNN.2002.806626","article-title":"Content-based audio classification and retrieval by support vector machines","volume":"14","author":"Guo","year":"2003","journal-title":"IEEE Trans. Neural Netw."},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"763","DOI":"10.1109\/TIFS.2008.2008216","article-title":"Using one-class SVMs and wavelets for audio surveillance","volume":"3","author":"Rabaoui","year":"2008","journal-title":"IEEE Trans. Inf. Forensics Secur."},{"key":"ref_53","doi-asserted-by":"crossref","first-page":"1142","DOI":"10.1109\/TASL.2009.2017438","article-title":"Environmental sound recognition with time-frequency audio features","volume":"17","author":"Chu","year":"2009","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"2605","DOI":"10.1109\/TIFS.2015.2469254","article-title":"Subband time-frequency image texture features for robust audio surveillance","volume":"10","author":"Sharan","year":"2015","journal-title":"IEEE Trans. Inf. Forensics Secur."},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Li, S., Yao, Y., Hu, J., Liu, G., Yao, X., and Hu, J. (2018). An ensemble stacked convolutional neural network model for environmental event sound recognition. Appl. Sci., 8.","DOI":"10.3390\/app8071152"},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Su, Y., Zhang, K., Wang, J., and Madani, K. (2019). Environment sound classification using a two-stream CNN based on decision-level fusion. Sensors, 19.","DOI":"10.3390\/s19071733"},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Mesaros, A., Heittola, T., and Virtanen, T. (2018, January 17\u201320). Acoustic scene classification: An overview of DCASE 2017 Challenge entries. Proceedings of the 16th International Workshop on Acoustic Signal Enhancement (IWAENC), Tokyo, Japan.","DOI":"10.1109\/IWAENC.2018.8521242"},{"key":"ref_58","doi-asserted-by":"crossref","first-page":"2887","DOI":"10.1007\/s11042-020-08836-3","article-title":"Deep learning-based late fusion of multimodal information for emotion classification of music video","volume":"80","author":"Pandeya","year":"2021","journal-title":"Multimed. Tools Appl."},{"key":"ref_59","unstructured":"Wang, H., Zou, Y., and Chong, D. (2020, January 2\u20134). Acoustic scene classification with spectrogram processing strategies. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), Tokyo, Japan."},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Sharan, R.V. (2020, January 3\u20135). Spoken digit recognition using wavelet scalogram and convolutional neural networks. Proceedings of the IEEE Recent Advances in Intelligent Computational Systems (RAICS), Thiruvananthapuram, India.","DOI":"10.1109\/RAICS51191.2020.9332505"},{"key":"ref_61","unstructured":"Cazals, Y., Horner, K., and Demany, L. (1992). Complex sounds and auditory images. Auditory Physiology and Perception, Pergamon."},{"key":"ref_62","doi-asserted-by":"crossref","first-page":"103","DOI":"10.1016\/0378-5955(90)90170-T","article-title":"Derivation of auditory filter shapes from notched-noise data","volume":"47","author":"Glasberg","year":"1990","journal-title":"Heart Res."},{"key":"ref_63","unstructured":"Slaney, M. (1988). Lyon\u2019s Cochlear Model, Apple Computer."},{"key":"ref_64","doi-asserted-by":"crossref","first-page":"2592","DOI":"10.1121\/1.399052","article-title":"A cochlear frequency-position function for several species-29 years later","volume":"87","author":"Greenwood","year":"1990","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_65","unstructured":"Slaney, M. (1993). An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank, Apple Computer, Inc."},{"key":"ref_66","unstructured":"Slaney, M. (1998). Auditory Toolbox for Matlab, Interval Research Corporation."},{"key":"ref_67","doi-asserted-by":"crossref","first-page":"20","DOI":"10.1016\/j.bspc.2016.10.004","article-title":"Heart sound classification based on scaled spectrogram and partial least squares regression","volume":"32","author":"Zhang","year":"2017","journal-title":"Biomed. Signal Process.Control"},{"key":"ref_68","first-page":"5067651","article-title":"Deep learning enabled fault diagnosis using time-frequency image analysis of rolling element bearings","volume":"2017","author":"Verstraete","year":"2017","journal-title":"Shock Vib."},{"key":"ref_69","doi-asserted-by":"crossref","unstructured":"Stallmann, C.F., and Engelbrecht, A.P. (2016, January 20\u201322). Signal modelling for the digital reconstruction of gramophone noise. Proceedings of the International Conference on E-Business and Telecommunications (ICETE) 2015, Colmar, France.","DOI":"10.1007\/978-3-319-30222-5_19"},{"key":"ref_70","doi-asserted-by":"crossref","first-page":"1153","DOI":"10.1109\/TASSP.1981.1163711","article-title":"Cubic convolution interpolation for digital image processing","volume":"29","author":"Keys","year":"1981","journal-title":"IEEE Trans. Acoust. Speech Signal Process."},{"key":"ref_71","first-page":"90","article-title":"The function sin x\/x","volume":"21","author":"Gearhart","year":"1990","journal-title":"Coll. Math. J."},{"key":"ref_72","unstructured":"Glassner, A.S. (1990). Filters for common resampling tasks. Graphics Gems, Morgan Kaufmann."},{"key":"ref_73","unstructured":"Nakamura, S., Hiyane, K., Asano, F., Nishiura, T., and Yamada, T. (June, January 31). Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition. Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC 2000), Athens, Greece."},{"key":"ref_74","unstructured":"Warden, P. (2018). Speech commands: A dataset for limited-vocabulary speech recognition. arXiv."},{"key":"ref_75","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."},{"key":"ref_76","unstructured":"Bishop, C.M. (2006). Pattern Recognition and Machine Learning, Springer."},{"key":"ref_77","unstructured":"Ioffe, S., and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv."},{"key":"ref_78","unstructured":"Nair, V., and Hinton, G.E. (2010, January 21\u201324). Rectified linear units improve restricted boltzmann machines. Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel."},{"key":"ref_79","unstructured":"Jarrett, K., Kavukcuoglu, K., Ranzato, M.A., and LeCun, Y. (October, January 29). What is the best multi-stage architecture for object recognition?. Proceedings of the IEEE 12th International Conference on Computer Vision, Kyoto, Japan."},{"key":"ref_80","doi-asserted-by":"crossref","first-page":"238","DOI":"10.5201\/ipol.2011.g_lmii","article-title":"Linear methods for image interpolation","volume":"1","author":"Getreuer","year":"2011","journal-title":"Image Process. Line"},{"key":"ref_81","doi-asserted-by":"crossref","first-page":"285","DOI":"10.1109\/JSTSP.2019.2909479","article-title":"Comparison and analysis of SampleCNN architectures for audio classification","volume":"13","author":"Kim","year":"2019","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"ref_82","doi-asserted-by":"crossref","unstructured":"Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., and Vinyals, O. (2015, January 6\u201310). Learning the speech front-end with raw waveform CLDNNs. Proceedings of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH), Dresden, Germany.","DOI":"10.21437\/Interspeech.2015-1"},{"key":"ref_83","doi-asserted-by":"crossref","first-page":"411","DOI":"10.1109\/LSP.2020.2975422","article-title":"CNN-based learnable gammatone filterbank and equal-loudness normalization for environmental sound classification","volume":"27","author":"Park","year":"2020","journal-title":"IEEE Signal Process. Lett."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/10\/3434\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:01:00Z","timestamp":1760162460000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/10\/3434"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,5,14]]},"references-count":83,"journal-issue":{"issue":"10","published-online":{"date-parts":[[2021,5]]}},"alternative-id":["s21103434"],"URL":"https:\/\/doi.org\/10.3390\/s21103434","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,5,14]]}}}