{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,5]],"date-time":"2026-03-05T16:08:01Z","timestamp":1772726881797,"version":"3.50.1"},"reference-count":68,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2020,3,5]],"date-time":"2020-03-05T00:00:00Z","timestamp":1583366400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,3,5]],"date-time":"2020-03-05T00:00:00Z","timestamp":1583366400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J AUDIO SPEECH MUSIC PROC."],"published-print":{"date-parts":[[2020,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>This paper presents a new approach based on recurrent neural networks (RNN) to the multiclass audio segmentation task whose goal is to classify an audio signal as speech, music, noise or a combination of these. The proposed system is based on the use of bidirectional long short-term Memory (BLSTM) networks to model temporal dependencies in the signal. The RNN is complemented by a resegmentation module, gaining long term stability by means of the tied state concept in hidden Markov models. We explore different neural architectures introducing temporal pooling layers to reduce the neural network output sampling rate. Our findings show that removing redundant temporal information is beneficial for the segmentation system showing a relative improvement close to 5%. Furthermore, this solution does not increase the number of parameters of the model and reduces the number of operations per second, allowing our system to achieve a real-time factor below 0.04 if running on CPU and below 0.03 if running on GPU. This new architecture combined with a data-agnostic data augmentation technique called mixup allows our system to achieve competitive results in both the Albayz\u00edn 2010 and 2012 evaluation datasets, presenting a relative improvement of 19.72% and 5.35% compared to the best results found in the literature for these databases.<\/jats:p>","DOI":"10.1186\/s13636-020-00172-6","type":"journal-article","created":{"date-parts":[[2020,3,5]],"date-time":"2020-03-05T10:03:31Z","timestamp":1583402611000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":29,"title":["Multiclass audio segmentation based on recurrent neural networks for broadcast domain data"],"prefix":"10.1186","volume":"2020","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3142-0708","authenticated-orcid":false,"given":"Pablo","family":"Gimeno","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ignacio","family":"Vi\u00f1als","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Alfonso","family":"Ortega","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Antonio","family":"Miguel","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Eduardo","family":"Lleida","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2020,3,5]]},"reference":[{"issue":"11","key":"172_CR1","first-page":"1","volume":"6","author":"T. Theodorou","year":"2014","unstructured":"T. Theodorou, I. Mporas, N. Fakotakis, An overview of automatic audio segmentation. Int. J. Inf. Technol. Comput. Sci. (IJITCS). 6(11), 1\u20139 (2014).","journal-title":"Int. J. Inf. Technol. Comput. Sci. (IJITCS)"},{"issue":"1","key":"172_CR2","doi-asserted-by":"publisher","first-page":"716","DOI":"10.1016\/j.asoc.2009.12.033","volume":"11","author":"P. Dhanalakshmi","year":"2011","unstructured":"P. Dhanalakshmi, S. Palanivel, V. Ramalingam, Classification of audio signals using AANN and GMM. Appl. Soft Comput.11(1), 716\u2013723 (2011).","journal-title":"Appl. Soft Comput."},{"key":"172_CR3","unstructured":"M. R. Hasan, M. Jamil, M. Rahman, et al., in 3rd International Conference on Electrical & Computer Engineering (ICECE). Speaker Identification Using Mel Frequency Cepstral Coefficients, (2004), pp. 565\u2013568."},{"key":"172_CR4","doi-asserted-by":"publisher","unstructured":"E. Wong, S. Sridharan, in Proc. IEEE International Symposium on Intelligent Multimedia, Video and Speech Processing. Comparison of linear prediction cepstrum coefficients and mel-frequency cepstrum coefficients for language identification, (2001), pp. 95\u201398. https:\/\/doi.org\/10.1109\/isimp.2001.925340.","DOI":"10.1109\/isimp.2001.925340"},{"key":"172_CR5","doi-asserted-by":"publisher","unstructured":"H. -Y. Lo, J. -C. Wang, H. -M. Wang, in IEEE International Conference on Multimedia and Expo (ICME). Homogeneous segmentation and classifier ensemble for audio tag annotation and retrieval, (2010), pp. 304\u2013309. https:\/\/doi.org\/10.1109\/icme.2010.5583009.","DOI":"10.1109\/icme.2010.5583009"},{"issue":"7","key":"172_CR6","doi-asserted-by":"publisher","first-page":"659","DOI":"10.1109\/LSP.2010.2049877","volume":"17","author":"A. Gallardo-Antol\u00edn","year":"2010","unstructured":"A. Gallardo-Antol\u00edn, J. M. Montero, Histogram equalization-based features for speech, music, and song discrimination. IEEE Sig. Process. Lett.17(7), 659\u2013662 (2010).","journal-title":"IEEE Sig. Process. Lett."},{"issue":"3","key":"172_CR7","doi-asserted-by":"publisher","first-page":"907","DOI":"10.1109\/TSA.2005.858057","volume":"14","author":"R. Huang","year":"2006","unstructured":"R. Huang, J. H. Hansen, Advances in unsupervised audio classification and segmentation for the broadcast news and NGSW corpora. IEEE Trans. Audio Speech Lang. Process.14(3), 907\u2013919 (2006).","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"172_CR8","doi-asserted-by":"publisher","unstructured":"J. Saunders, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Real-time discrimination of broadcast speech\/music, (1996), pp. 993\u2013996. https:\/\/doi.org\/10.1109\/icassp.1996.543290.","DOI":"10.1109\/icassp.1996.543290"},{"issue":"1","key":"172_CR9","doi-asserted-by":"publisher","first-page":"266","DOI":"10.1109\/TSA.2005.852992","volume":"14","author":"C. -H. Wu","year":"2006","unstructured":"C. -H. Wu, Y. -H. Chiu, C. -J. Shia, C. -Y. Lin, Automatic segmentation and identification of mixed-language speech using delta-BIC and LSA-based GMMs. IEEE Trans. Audio Speech Lang. Process.14(1), 266\u2013276 (2006).","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"issue":"5","key":"172_CR10","doi-asserted-by":"publisher","first-page":"920","DOI":"10.1109\/TASL.2008.925152","volume":"16","author":"M. Kotti","year":"2008","unstructured":"M. Kotti, E. Benetos, C. Kotropoulos, Computationally efficient and robust BIC-based speaker segmentation. IEEE Trans. Audio Speech Lang. Process.16(5), 920\u2013933 (2008).","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"issue":"4","key":"172_CR11","doi-asserted-by":"publisher","first-page":"331","DOI":"10.1109\/LSP.2013.2247039","volume":"20","author":"A. Dessein","year":"2013","unstructured":"A. Dessein, A. Cont, An information-geometric approach to real-time audio segmentation. IEEE Sig. Process. Lett.20(4), 331\u2013334 (2013).","journal-title":"IEEE Sig. Process. Lett."},{"key":"172_CR12","doi-asserted-by":"publisher","unstructured":"J. Foote, in IEEE International Conference on Multimedia and Expo (ICME). Automatic audio segmentation using a measure of audio novelty, (2000), pp. 452\u2013455. https:\/\/doi.org\/10.1109\/icme.2000.869637.","DOI":"10.1109\/icme.2000.869637"},{"key":"172_CR13","doi-asserted-by":"publisher","unstructured":"R. Yin, H. Bredin, C. Barras, in Proc. Interspeech 2017. Speaker change detection in broadcast tv using bidirectional long short-term memory networks, (2017), pp. 3827\u20133831. https:\/\/doi.org\/10.21437\/interspeech.2017-65.","DOI":"10.21437\/interspeech.2017-65"},{"key":"172_CR14","unstructured":"A. Misra, in Proc. Interspeech. Speech\/nonspeech segmentation in web videos, (2012), pp. 1977\u20131980."},{"key":"172_CR15","doi-asserted-by":"publisher","unstructured":"G. Richard, M. Ramona, S. Essid, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Combined supervised and unsupervised approaches for automatic segmentation of radiophonic audio streams, (2007), pp. 461\u2013464. https:\/\/doi.org\/10.1109\/icassp.2007.366272.","DOI":"10.1109\/icassp.2007.366272"},{"key":"172_CR16","doi-asserted-by":"publisher","first-page":"2","DOI":"10.1155\/2009\/239892","volume":"2009","author":"Y. Lavner","year":"2009","unstructured":"Y. Lavner, D. Ruinskiy, A decision-tree-based algorithm for speech\/music classification and segmentation. EURASIP J. Audio Speech Music. Process.2009:, 2 (2009).","journal-title":"EURASIP J. Audio Speech Music. Process."},{"issue":"1","key":"172_CR17","doi-asserted-by":"publisher","first-page":"34","DOI":"10.1186\/s13636-014-0034-5","volume":"2014","author":"D. Cast\u00e1n","year":"2014","unstructured":"D. Cast\u00e1n, A. Ortega, A. Miguel, E. Lleida, Audio segmentation-by-classification approach based on factor analysis in broadcast news domain. EURASIP J. Audio Speech Music. Process.2014(1), 34 (2014).","journal-title":"EURASIP J. Audio Speech Music. Process."},{"issue":"3","key":"172_CR18","doi-asserted-by":"publisher","first-page":"351","DOI":"10.1016\/S0167-6393(02)00087-0","volume":"40","author":"J. Ajmera","year":"2003","unstructured":"J. Ajmera, I. McCowan, H. Bourlard, Speech\/music segmentation using entropy and dynamism features in a HMM classification framework. Speech Commun.40(3), 351\u2013363 (2003).","journal-title":"Speech Commun."},{"key":"172_CR19","doi-asserted-by":"publisher","unstructured":"L. Lu, H. Jiang, H. Zhang, in Proc. 9th ACM International Conference on Multimedia. A robust audio classification and segmentation method, (2001), pp. 203\u2013211. https:\/\/doi.org\/10.1145\/500141.500173.","DOI":"10.1145\/500141.500173"},{"issue":"6","key":"172_CR20","doi-asserted-by":"publisher","first-page":"82","DOI":"10.1109\/MSP.2012.2205597","volume":"29","author":"G. Hinton","year":"2012","unstructured":"G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. -r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Sig. Process. Mag.29(6), 82\u201397 (2012).","journal-title":"IEEE Sig. Process. Mag."},{"key":"172_CR21","doi-asserted-by":"publisher","unstructured":"L. Deng, G. Hinton, B. Kingsbury, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). New types of deep neural network learning for speech recognition and related applications: An overview, (2013), pp. 8599\u20138603. https:\/\/doi.org\/10.1109\/icassp.2013.6639344.","DOI":"10.1109\/icassp.2013.6639344"},{"key":"172_CR22","doi-asserted-by":"publisher","unstructured":"D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, in Proc. Interspeech. Deep neural network embeddings for text-independent speaker verification, (2017), pp. 999\u20131003. https:\/\/doi.org\/10.21437\/interspeech.2017-620.","DOI":"10.21437\/interspeech.2017-620"},{"key":"172_CR23","doi-asserted-by":"publisher","unstructured":"D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, A. McCree, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Speaker diarization using deep neural network embeddings, (2017), pp. 4930\u20134934. https:\/\/doi.org\/10.1109\/icassp.2017.7953094.","DOI":"10.1109\/icassp.2017.7953094"},{"key":"172_CR24","doi-asserted-by":"publisher","unstructured":"X. Shao, C. Xu, M. S. Kankanhalli, in Proc. Joint 4th International Conference on Information, Communications and Signal Processing, and 4th Pacific Rim Conference on Multimedia. Applying neural network on the content-based audio classification, (2003), pp. 1821\u20131825. https:\/\/doi.org\/10.1109\/icics.2003.1292781.","DOI":"10.1109\/icics.2003.1292781"},{"key":"172_CR25","doi-asserted-by":"crossref","unstructured":"H. Meinedo, J. Neto, in 9th European Conference on Speech Communication and Technology. A stream-based audio segmentation, classification and clustering pre-processing system for broadcast news using ANN models, (2005).","DOI":"10.21437\/Interspeech.2005-117"},{"key":"172_CR26","doi-asserted-by":"publisher","first-page":"8","DOI":"10.1016\/j.dsp.2018.03.004","volume":"81","author":"Xu-Kui Yang","year":"2018","unstructured":"X. -K. Yang, D. Qu, W. -L. Zhang, W. -Q. Zhang, An adapted data selection for deep learning-based audio segmentation in multi-genre broadcast channel. Digit. Sig. Process. (2018). https:\/\/doi.org\/10.1016\/j.dsp.2018.03.004.","journal-title":"Digital Signal Processing"},{"key":"172_CR27","doi-asserted-by":"publisher","unstructured":"K. J. Piczak, in IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP). Environmental sound classification with convolutional neural networks, (2015), pp. 1\u20136. https:\/\/doi.org\/10.1109\/mlsp.2015.7324337.","DOI":"10.1109\/mlsp.2015.7324337"},{"key":"172_CR28","unstructured":"D. Doukhan, J. Carrive, in 9th International Conferences on Advances in Multimedia (MMEDIA). Investigating the use of semi-supervised convolutional neural network models for speech\/music classification and segmentation, (2017)."},{"issue":"1","key":"172_CR29","doi-asserted-by":"publisher","first-page":"11","DOI":"10.1186\/s13636-019-0155-y","volume":"2019","author":"B. -Y. Jang","year":"2019","unstructured":"B. -Y. Jang, W. -H. Heo, J. -H. Kim, O. -W. Kwon, Music detection from broadcast contents using convolutional neural networks with a mel-scale kernel. EURASIP J. Audio Speech Music. Process.2019(1), 11 (2019).","journal-title":"EURASIP J. Audio Speech Music. Process."},{"issue":"8","key":"172_CR30","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","volume":"9","author":"S. Hochreiter","year":"1997","unstructured":"S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput.9(8), 1735\u20131780 (1997).","journal-title":"Neural Comput."},{"key":"172_CR31","doi-asserted-by":"publisher","unstructured":"F. A. Gers, J. Schmidhuber, F. Cummins, in 9th International Conference on Artificial Neural Networks (ICANN). Learning to forget: continual prediction with LSTM, (1999), pp. 850\u2013855. https:\/\/doi.org\/10.1049\/cp:19991218.","DOI":"10.1049\/cp:19991218"},{"key":"172_CR32","doi-asserted-by":"publisher","unstructured":"A. Graves, A. -r. Mohamed, G. Hinton, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Speech recognition with deep recurrent neural networks, (2013), pp. 6645\u20136649. https:\/\/doi.org\/10.1109\/icassp.2013.6638947.","DOI":"10.1109\/icassp.2013.6638947"},{"key":"172_CR33","unstructured":"M. Sundermeyer, R. Schl\u00fcter, H. Ney, in Proc. Interspeech. LSTM neural networks for language modeling, (2012), pp. 194\u2013197."},{"key":"172_CR34","doi-asserted-by":"publisher","unstructured":"G. Heigold, I. Moreno, S. Bengio, N. Shazeer, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). End-to-end text-dependent speaker verification, (2016), pp. 5115\u20135119. https:\/\/doi.org\/10.1109\/icassp.2016.7472652.","DOI":"10.1109\/icassp.2016.7472652"},{"key":"172_CR35","doi-asserted-by":"publisher","unstructured":"F. Eyben, F. Weninger, S. Squartini, B. Schuller, in IEEE International Conference O nAcoustics, Speech and Signal Processing (ICASSP). Real-life voice activity detection with LSTM recurrent neural networks and an application to Hollywood movies, (2013), pp. 483\u2013487. https:\/\/doi.org\/10.1109\/icassp.2013.6637694.","DOI":"10.1109\/icassp.2013.6637694"},{"key":"172_CR36","doi-asserted-by":"publisher","unstructured":"J. Kim, J. Kim, S. Lee, J. Park, M. Hahn, in Proc. 8th International Conference on Signal Processing Systems. Vowel based voice activity detection with LSTM recurrent neural network, (2016), pp. 134\u2013137. https:\/\/doi.org\/10.1145\/3015166.3015207.","DOI":"10.1145\/3015166.3015207"},{"issue":"1","key":"172_CR37","doi-asserted-by":"publisher","first-page":"9","DOI":"10.1186\/s13636-019-0152-1","volume":"2019","author":"D. de Benito-Gorron","year":"2019","unstructured":"D. de Benito-Gorron, A. Lozano-Diez, D. T. Toledano, J. Gonzalez-Rodriguez, Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset. EURASIP J. Audio Speech Music. Process.2019(1), 9 (2019).","journal-title":"EURASIP J. Audio Speech Music. Process."},{"key":"172_CR38","doi-asserted-by":"publisher","unstructured":"I. Vi\u00f1als, P. Gimeno, A. Ortega, A. Miguel, E. Lleida, in Proc. Interspeech. Estimation of the number of speakers with variational bayesian PLDA in the DIHARD diarization challenge, (2018), pp. 2803\u20132807. https:\/\/doi.org\/10.21437\/interspeech.2018-1841.","DOI":"10.21437\/interspeech.2018-1841"},{"key":"172_CR39","doi-asserted-by":"publisher","unstructured":"I. Vi\u00f1als, P. Gimeno, A. Ortega, A. Miguel, E. Lleida, In-domain adaptation solutions for the RTVE 2018 diarization challenge. https:\/\/doi.org\/10.21437\/iberspeech.2018-45.","DOI":"10.21437\/iberspeech.2018-45"},{"key":"172_CR40","unstructured":"N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, M. Liberman, First dihard challenge evaluation plan, 2018. tech. Rep. (2018)."},{"issue":"24","key":"172_CR41","doi-asserted-by":"publisher","first-page":"5412","DOI":"10.3390\/app9245412","volume":"9","author":"E. Lleida","year":"2019","unstructured":"E. Lleida, A. Ortega, A. Miguel, V. Baz\u00e1n-Gil, C. P\u00e9rez, M. G\u00f3mez, A. de Prada, Albayzin 2018 evaluation: the iberspeech-RTVE challenge on speech technologies for spanish broadcast media. Appl. Sci.9(24), 5412 (2019).","journal-title":"Appl. Sci."},{"key":"172_CR42","doi-asserted-by":"publisher","first-page":"93","DOI":"10.1007\/978-3-642-11674-2_5","volume-title":"Advances in Music Information Retrieval","author":"J. Stephen Downie","year":"2010","unstructured":"J. S. Downie, A. F. Ehmann, M. Bay, M. C. Jones, The music information retrieval evaluation exchange: Some observations and insights. Advances in music information retrieval, 93\u2013115 (2010). https:\/\/doi.org\/10.1007\/978-3-642-11674-2_5."},{"key":"172_CR43","unstructured":"B. Mel\u00e9ndez-Catal\u00e1n, E. Molina, E. Gomez, in Music Information Retrieval Evaluation eX-change (MIREX). Music and\/or speech detection MIREX 2018 submission, (2018)."},{"key":"172_CR44","unstructured":"M. Choi, J. Lee, J. Nam, in Music Information Retrieval Evaluation eX-change (MIREX). Hybrid features for music and speech detection, (2018)."},{"key":"172_CR45","volume-title":"Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)","author":"M. Mandel","year":"2019","unstructured":"M. Mandel, J. Salamon, D. P. W. Ellis, Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019) (New York University, New York, 2019)."},{"key":"172_CR46","doi-asserted-by":"publisher","unstructured":"S. Kapka, M. Lewandowski, in Proc. DCASE2019 Challenge. Sound source detection, localization and classification using consecutive ensemble of CRNN models, (2019). https:\/\/doi.org\/10.33682\/9f2t-ab23.","DOI":"10.33682\/9f2t-ab23"},{"key":"172_CR47","doi-asserted-by":"publisher","unstructured":"L. Lin, X. Wang, in Proc. DCASE2019 Challenge. Guided learning convolution system for DCASE 2019 task 4, (2019). https:\/\/doi.org\/10.33682\/53ed-z889.","DOI":"10.33682\/53ed-z889"},{"key":"172_CR48","doi-asserted-by":"crossref","unstructured":"S. Galliano, E. Geoffrois, D. Mostefa, K. Choukri, J. -F. Bonastre, G. Gravier, in 9th European Conference on Speech Communication and Technology. The ESTER phase II evaluation campaign for the rich transcription of french broadcast news, (2005).","DOI":"10.21437\/Interspeech.2005-441"},{"key":"172_CR49","doi-asserted-by":"crossref","unstructured":"J. \u017eibert, F. Miheli\u010d, J. P. Martens, H. Meinedo, J. P. D. S. Neto, L. Docio, C. G. Garcia-Mateo, P. David, J. \u017e\u010f\u00e1nsky\u0300, M. Pleva, et al., in 9th European Conference on Speech Communication and Technology. The COST278 broadcast news segmentation and speaker clustering evaluation-overview, methodology, systems, results, (2005).","DOI":"10.21437\/Interspeech.2005-68"},{"issue":"1","key":"172_CR50","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/1687-4722-2011-1","volume":"2011","author":"T. Butko","year":"2011","unstructured":"T. Butko, C. Nadeu, Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion. EURASIP Journal on Audio, Speech, and Music Processing. 2011(1), 1 (2011).","journal-title":"EURASIP Journal on Audio, Speech, and Music Processing"},{"key":"172_CR51","unstructured":"A. Ortega, D. Castan, A. Miguel, E. Lleida, in Proc. Iberspeech 2014: VIII Jornadas en Tecnolog\u00eda del Habla and IV Iberian SLTech Workshop. The Albayz\u00edn 2012 audio segmentation evaluation, pp. 283\u2013289."},{"key":"172_CR52","doi-asserted-by":"publisher","unstructured":"P. Gimeno, I. Vi\u00f1als, A. Ortega, A. Miguel, E. Lleida, in Proc. Iberspeech 2018. A recurrent neural network approach to audio segmentation for broadcast domain data, pp. 87\u201391. https:\/\/doi.org\/10.21437\/iberspeech.2018-19.","DOI":"10.21437\/iberspeech.2018-19"},{"key":"172_CR53","unstructured":"T. Ko, V. Peddinti, D. Povey, S. Khudanpur, in Proc. Interspeech. Audio augmentation for speech recognition, (2015), pp. 3586\u20133589."},{"key":"172_CR54","unstructured":"J. Schl\u00fcter, T. Grill, in 6th International Society for Music Information Retrieval (ISMIR) Conference. Exploring data augmentation for improved singing voice detection with neural networks, (2015), pp. 121\u2013126."},{"key":"172_CR55","unstructured":"H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)."},{"key":"172_CR56","doi-asserted-by":"publisher","unstructured":"M. A. Bartsch, G. H. Wakefield, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. To catch a chorus: using chroma-based representations for audio thumbnailing, (2001), pp. 15\u201318. https:\/\/doi.org\/10.1109\/aspaa.2001.969531.","DOI":"10.1109\/aspaa.2001.969531"},{"key":"172_CR57","unstructured":"N. Jiang, P. Grosche, V. Konz, M. M\u00fcller, in 42nd International Conference: Semantic Audio. Analyzing chroma feature types for automated chord recognition, (2011)."},{"key":"172_CR58","doi-asserted-by":"publisher","unstructured":"H. Papadopoulos, G. Peeters, in International Workshop on Content-Based Multimedia Indexing (CBMI). Large-scale study of chord estimation algorithms based on chroma representation and HMM, (2007), pp. 53\u201360. https:\/\/doi.org\/10.1109\/cbmi.2007.385392.","DOI":"10.1109\/cbmi.2007.385392"},{"key":"172_CR59","unstructured":"F. Eyben, F. Weninger, F. Gross, B. Schuller, in Proc. 21st ACM International Conference on Multimedia. Recent developments in openSMILE, the Munich open-source multimedia feature extractor, (2013), pp. 835\u2013838."},{"key":"172_CR60","unstructured":"S. Ruder, An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016)."},{"key":"172_CR61","unstructured":"A. Paszke, G. Chanan, Z. Lin, S. Gross, E. Yang, L. Antiga, Z. Devito, Automatic differentiation in PyTorch, vol. 30, (2017)."},{"issue":"4","key":"172_CR62","doi-asserted-by":"publisher","first-page":"988","DOI":"10.1109\/78.492552","volume":"44","author":"F. Gustafsson","year":"1996","unstructured":"F. Gustafsson, Determining the initial states in forward-backward filtering. IEEE Trans. Sig. Process.44(4), 988\u2013992 (1996).","journal-title":"IEEE Trans. Sig. Process."},{"key":"172_CR63","unstructured":"NIST, The 2009 (RT-09) Rich Transcription Meeting Recognition Evaluation Plan, (Melbourne, 2009)."},{"key":"172_CR64","doi-asserted-by":"publisher","first-page":"14","DOI":"10.1007\/978-3-030-00764-5_2","volume-title":"Advances in Multimedia Information Processing \u2013 PCM 2018","author":"Kele Xu","year":"2018","unstructured":"K. Xu, D. Feng, H. Mi, B. Zhu, D. Wang, L. Zhang, H. Cai, S. Liu, in 19th Pacific Rim Conference on Multimedia. Mixup-based acoustic scene classification using multi-channel convolutional neural network, (2018), pp. 14\u201323. https:\/\/doi.org\/10.1007\/978-3-030-00764-5_2."},{"key":"172_CR65","unstructured":"A. Gallardo Antol\u00edn, R. San Segundo Hern\u00e1ndez, UPM-UC3M system for music and speech segmentation (2010)."},{"key":"172_CR66","unstructured":"R. Serizel, N. Turpault, H. Eghbal-Zadeh, A. P. Shah, in Proc. Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018). Large-scale weakly labeled semi-supervised sound event detection in domestic environments, (2018), pp. 19\u201323. https:\/\/hal.inria.fr\/hal-01850270."},{"key":"172_CR67","unstructured":"D. Tavarez, E. Navas, D. Erro, I. Saratxaga, in Proc. Iberspeech 2012. Audio segmentation system by Aholab for Albayzin 2012 evaluation campaign, pp. 577\u2013584."},{"key":"172_CR68","unstructured":"S. Cerd\u00e0, J. Albert, A. Gim\u00e9nez Pastor, J. Andr\u00e9s Ferrer, J. Civera Saiz, A. Juan C\u00edscar, Albayzin evaluation: the PRHLT-UPV audio segmentation system."}],"container-title":["EURASIP Journal on Audio, Speech, and Music Processing"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-020-00172-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1186\/s13636-020-00172-6\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-020-00172-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,10,17]],"date-time":"2022-10-17T17:23:38Z","timestamp":1666027418000},"score":1,"resource":{"primary":{"URL":"https:\/\/asmp-eurasipjournals.springeropen.com\/articles\/10.1186\/s13636-020-00172-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,3,5]]},"references-count":68,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2020,12]]}},"alternative-id":["172"],"URL":"https:\/\/doi.org\/10.1186\/s13636-020-00172-6","relation":{},"ISSN":["1687-4722"],"issn-type":[{"value":"1687-4722","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,3,5]]},"assertion":[{"value":"1 August 2019","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 February 2020","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"5 March 2020","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare that they have no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"5"}}