{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T16:42:56Z","timestamp":1777048976353,"version":"3.51.4"},"reference-count":64,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2024,5,8]],"date-time":"2024-05-08T00:00:00Z","timestamp":1715126400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,5,8]],"date-time":"2024-05-08T00:00:00Z","timestamp":1715126400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Neural Process Lett"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>To address the challenges of the poor representation capability and low data utilization rate of end-to-end speech recognition models in deep learning, this study proposes an end-to-end speech recognition model based on multi-scale feature fusion and multi-view self-supervised learning (MM-ASR). It adopts a multi-task learning paradigm for training. The proposed method emphasizes the importance of inter-layer information within shared encoders, aiming to enhance the model\u2019s characterization capability via the multi-scale feature fusion module. Moreover, we apply multi-view self-supervised learning to effectively exploit data information. Our approach is rigorously evaluated on the Aishell-1 dataset and further validated its effectiveness on the English corpus WSJ. The experimental results demonstrate a noteworthy 4.6<jats:inline-formula><jats:alternatives><jats:tex-math>$$\\%$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mo>%<\/mml:mo>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula> reduction in character error rate, indicating significantly improved speech recognition performance . These findings showcase the effectiveness and potential of our proposed MM-ASR model for end-to-end speech recognition tasks.<\/jats:p>","DOI":"10.1007\/s11063-024-11614-z","type":"journal-article","created":{"date-parts":[[2024,5,8]],"date-time":"2024-05-08T12:01:46Z","timestamp":1715169706000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":13,"title":["Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition"],"prefix":"10.1007","volume":"56","author":[{"given":"Jingyu","family":"Zhao","sequence":"first","affiliation":[]},{"given":"Ruwei","family":"Li","sequence":"additional","affiliation":[]},{"given":"Maocun","family":"Tian","sequence":"additional","affiliation":[]},{"given":"Weidong","family":"An","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,5,8]]},"reference":[{"issue":"4","key":"11614_CR1","doi-asserted-by":"publisher","first-page":"50","DOI":"10.1109\/MSP.2011.941065","volume":"28","author":"ML Seltzer","year":"2011","unstructured":"Seltzer ML, Ju Y-C, Tashev I, Wang Y-Y, Yu D (2011) In-car media search. IEEE Signal Process Mag 28(4):50\u201360. https:\/\/doi.org\/10.1109\/MSP.2011.941065","journal-title":"IEEE Signal Process Mag"},{"key":"11614_CR2","doi-asserted-by":"publisher","unstructured":"Graves A, Mohamed A-R, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing, vancouver, BC, Canada, pp 6645-6649, https:\/\/doi.org\/10.1109\/ICASSP.2013.6638947","DOI":"10.1109\/ICASSP.2013.6638947"},{"issue":"6","key":"11614_CR3","doi-asserted-by":"publisher","first-page":"82","DOI":"10.1109\/MSP.2012.2205597","volume":"29","author":"G Hinton","year":"2012","unstructured":"Hinton G et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82\u201397. https:\/\/doi.org\/10.1109\/MSP.2012.2205597","journal-title":"IEEE Signal Process Mag"},{"issue":"8","key":"11614_CR4","doi-asserted-by":"publisher","first-page":"1018","DOI":"10.3390\/sym11081018","volume":"11","author":"D Wang","year":"2019","unstructured":"Wang D, Wang X, Lv S (2019) An overview of end-to-end automatic speech recognition. Symmetry 11(8):1018","journal-title":"Symmetry"},{"key":"11614_CR5","doi-asserted-by":"crossref","unstructured":"Li J (2022) Recent advances in end-to-end automatic speech recognition. APSIPA Trans Signal Inf Process 11(1)","DOI":"10.1561\/116.00000050"},{"key":"11614_CR6","doi-asserted-by":"crossref","unstructured":"Graves A, Fern\u00e1ndez S, Gomez F, et al (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on Machine learning. pp 369-376","DOI":"10.1145\/1143844.1143891"},{"key":"11614_CR7","doi-asserted-by":"publisher","unstructured":"Deng K, et al (2022) Improving CTC-Based Speech Recognition Via Knowledge Transferring from Pre-Trained Language Models. In: ICASSP 2022\u20132022 IEEE international conference on acoustics, speech and signal processing (ICASSP), Singapore, Singapore, pp 8517-8521, https:\/\/doi.org\/10.1109\/ICASSP43922.2022.9747887","DOI":"10.1109\/ICASSP43922.2022.9747887"},{"key":"11614_CR8","doi-asserted-by":"crossref","unstructured":"Nakagome Y, Komatsu T, Fujita Y, et al (2022) InterAug: augmenting noisy intermediate predictions for CTC-based ASR. arXiv preprint arXiv:2204.00174","DOI":"10.21437\/Interspeech.2022-11284"},{"key":"11614_CR9","doi-asserted-by":"crossref","unstructured":"Graves A (2012) Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711","DOI":"10.1007\/978-3-642-24797-2"},{"key":"11614_CR10","doi-asserted-by":"publisher","unstructured":"Kim S, Hori T, Watanabe S (2017) Joint CTC-attention based end-to-end speech recognition using multi-task learning. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, LA, USA, pp 4835-4839, https:\/\/doi.org\/10.1109\/ICASSP.2017.7953075","DOI":"10.1109\/ICASSP.2017.7953075"},{"key":"11614_CR11","doi-asserted-by":"publisher","unstructured":"Rao K, Sak H, Prabhavalkar R (2017) Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer. In: (2017) IEEE automatic speech recognition and understanding workshop (ASRU). Okinawa, Japan, pp 193\u2013199. https:\/\/doi.org\/10.1109\/ASRU.2017.8268935","DOI":"10.1109\/ASRU.2017.8268935"},{"key":"11614_CR12","first-page":"1408","volume":"2019","author":"S Karita","year":"2019","unstructured":"Karita S, Soplin NEY, Watanabe S et al (2019) Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration[C]\/\/Proceedings of the Annual Conference of the International Speech Communication Association. INTERSPEECH. 2019:1408\u20131412","journal-title":"INTERSPEECH."},{"key":"11614_CR13","doi-asserted-by":"publisher","unstructured":"Kim S, Hori T, Watanabe S (2017) Joint CTC-attention based end-to-end speech recognition using multi-task learning. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, LA, USA, pp 4835-4839, https:\/\/doi.org\/10.1109\/ICASSP.2017.7953075","DOI":"10.1109\/ICASSP.2017.7953075"},{"key":"11614_CR14","unstructured":"Zhang B, Wu D, Yao Z, et al (2020) Unified streaming and non-streaming two-pass end-to-end model for speech recognition. arXiv preprint arXiv:2012.05481"},{"key":"11614_CR15","unstructured":"Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30"},{"key":"11614_CR16","doi-asserted-by":"crossref","unstructured":"Gulati A, Qin J, Chiu CC, et al (2020) Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100","DOI":"10.21437\/Interspeech.2020-3015"},{"key":"11614_CR17","doi-asserted-by":"crossref","unstructured":"Yao Z, Wu D, Wang X, et al (2021) Wenet: production oriented streaming and non-streaming end-to-end speech recognition toolkit. arXiv preprint arXiv:2102.01547","DOI":"10.21437\/Interspeech.2021-1983"},{"key":"11614_CR18","doi-asserted-by":"crossref","unstructured":"Zhang B, Wu D, Peng Z, et al (2022) Wenet 2.0: more productive end-to-end speech recognition toolkit. arXiv preprint arXiv:2203.15455","DOI":"10.21437\/Interspeech.2022-483"},{"key":"11614_CR19","doi-asserted-by":"publisher","unstructured":"Bu H, Du J, Na X, Wu B, Zheng H (2017) AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline. In: 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I\/O systems and assessment (O-COCOSDA), Seoul. Korea (South) 2017, pp 1\u20135. https:\/\/doi.org\/10.1109\/ICSDA.2017.8384449","DOI":"10.1109\/ICSDA.2017.8384449"},{"issue":"9","key":"11614_CR20","doi-asserted-by":"publisher","first-page":"1481","DOI":"10.1109\/TASLP.2019.2922832","volume":"27","author":"Y-C Chen","year":"2019","unstructured":"Chen Y-C, Huang S-F, Lee H-y, Wang Y-H, Shen C-H (2019) Audio word2vec: sequence-to-sequence autoencoding for unsupervised learning of audio segmentation and representation. IEEE\/ACM Trans Audio Speech Lang Process (TASLP) 27(9):1481\u20131493","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process (TASLP)"},{"key":"11614_CR21","doi-asserted-by":"crossref","unstructured":"Hsu W-N, Zhang Y, Glass J (2017) Learning latent representations for speech generation and transformation. In: Interspeech, pp 1273\u20131277","DOI":"10.21437\/Interspeech.2017-349"},{"key":"11614_CR22","unstructured":"Hsu W N, Zhang Y, Glass J (2017) Unsupervised learning of disentangled and interpretable representations from sequential data. Adv Neural Inf Process Syst 30"},{"issue":"12","key":"11614_CR23","doi-asserted-by":"publisher","first-page":"2041","DOI":"10.1109\/TASLP.2019.2938863","volume":"27","author":"J Chorowski","year":"2019","unstructured":"Chorowski J, Weiss RJ, Bengio S et al (2019) Unsupervised speech representation learning using wavenet autoencoders. IEEE\/ACM Trans Audio Speech Lang Process 27(12):2041\u20132053","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"11614_CR24","doi-asserted-by":"crossref","unstructured":"Chung Y A, Tang H, Glass J (2020) Vector-quantized autoregressive predictive coding. arXiv preprint arXiv:2005.08392","DOI":"10.21437\/Interspeech.2020-1228"},{"key":"11614_CR25","doi-asserted-by":"crossref","unstructured":"Chung Y A, Hsu W N, Tang H, et al (2019) An unsupervised autoregressive model for speech representation learning. arXiv preprint arXiv:1904.03240","DOI":"10.21437\/Interspeech.2019-1473"},{"key":"11614_CR26","doi-asserted-by":"crossref","unstructured":"Chung Y A, Glass J (2020) Generative pre-training for speech with autoregressive predictive coding. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE pp 3497-3501","DOI":"10.1109\/ICASSP40776.2020.9054438"},{"key":"11614_CR27","doi-asserted-by":"crossref","unstructured":"Chung Y A, Glass J (2020) Improved speech representations with multi-target autoregressive predictive coding. arXiv preprint arXiv:2004.05274","DOI":"10.18653\/v1\/2020.acl-main.213"},{"key":"11614_CR28","doi-asserted-by":"crossref","unstructured":"Liu A H, Chung Y A, Glass J (2020) Non-autoregressive predictive coding for learning speech representations from local dependencies. arXiv preprint arXiv:2011.00406","DOI":"10.21437\/Interspeech.2021-349"},{"key":"11614_CR29","doi-asserted-by":"publisher","first-page":"2351","DOI":"10.1109\/TASLP.2021.3095662","volume":"29","author":"AT Liu","year":"2021","unstructured":"Liu AT, Li SW, Lee H (2021) Tera: Self-supervised learning of transformer encoder representation for speech. IEEE\/ACM Trans Audio Speech Lang Process 29:2351\u20132366","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"11614_CR30","doi-asserted-by":"crossref","unstructured":"Liu A T, Yang S, Chi P H, et al (2020) Mockingjay: unsupervised speech representation learning with deep bidirectional transformer encoders. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6419\u20136423","DOI":"10.1109\/ICASSP40776.2020.9054458"},{"key":"11614_CR31","doi-asserted-by":"crossref","unstructured":"Ling S, Liu Y, Salazar J, et al (2020) Deep contextualized acoustic representations for semi-supervised speech recognition. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6429\u20136433","DOI":"10.1109\/ICASSP40776.2020.9053176"},{"key":"11614_CR32","unstructured":"Ling S, Liu Y (2020) Decoar 2.0: deep contextualized acoustic representations with vector quantization. arXiv preprint arXiv:2012.06659"},{"key":"11614_CR33","unstructured":"Oord A, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748"},{"key":"11614_CR34","doi-asserted-by":"crossref","unstructured":"Schneider S, Baevski A, Collobert R, et al (2019) wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862","DOI":"10.21437\/Interspeech.2019-1873"},{"key":"11614_CR35","unstructured":"Baevski A, Schneider S, Auli M (2019) vq-wav2vec: self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453"},{"key":"11614_CR36","first-page":"12449","volume":"33","author":"A Baevski","year":"2020","unstructured":"Baevski A, Zhou Y, Mohamed A et al (2020) wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv Neural Inf Process Syst 33:12449\u201312460","journal-title":"Adv Neural Inf Process Syst"},{"key":"11614_CR37","doi-asserted-by":"crossref","unstructured":"Baevski A, Mohamed A (2020) Effectiveness of self-supervised pre-training for ASR. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7694\u20137698","DOI":"10.1109\/ICASSP40776.2020.9054224"},{"key":"11614_CR38","doi-asserted-by":"publisher","first-page":"3451","DOI":"10.1109\/TASLP.2021.3122291","volume":"29","author":"WN Hsu","year":"2021","unstructured":"Hsu WN, Bolte B, Tsai YHH et al (2021) Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE\/ACM Trans Audio Speech Lang Process 29:3451\u20133460","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"11614_CR39","unstructured":"Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25"},{"key":"11614_CR40","doi-asserted-by":"crossref","unstructured":"Toshev A, Szegedy C (2014) Deeppose: human pose estimation via deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1653\u20131660","DOI":"10.1109\/CVPR.2014.214"},{"key":"11614_CR41","doi-asserted-by":"crossref","unstructured":"Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431\u20133440","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"11614_CR42","first-page":"91","volume":"28","author":"S Ren","year":"2015","unstructured":"Ren S, He K, Girshick R et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91\u201399","journal-title":"Adv Neural Inf Process Syst"},{"key":"11614_CR43","doi-asserted-by":"crossref","unstructured":"Desplanques B, Thienpondt J, Demuynck K (2020) Ecapatdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143","DOI":"10.21437\/Interspeech.2020-2650"},{"key":"11614_CR44","doi-asserted-by":"crossref","unstructured":"Thienpondt J, Desplanques B, Demuynck K (2021) Integrating frequency translational invariance in tdnns and frequency positional information in 2d resnets to enhance speaker verification. arXiv preprint arXiv:2104.02370","DOI":"10.21437\/Interspeech.2021-1570"},{"key":"11614_CR45","doi-asserted-by":"crossref","unstructured":"Liu T, Das R K, Lee K A, et al (2022) MFA: TDNN with multi-scale frequency-channel attention for text-independent speaker verification with short utterances. In: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7517\u20137521","DOI":"10.1109\/ICASSP43922.2022.9747021"},{"key":"11614_CR46","doi-asserted-by":"crossref","unstructured":"Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp7132\u20137141","DOI":"10.1109\/CVPR.2018.00745"},{"issue":"2","key":"11614_CR47","doi-asserted-by":"publisher","first-page":"652","DOI":"10.1109\/TPAMI.2019.2938758","volume":"43","author":"SH Gao","year":"2019","unstructured":"Gao SH, Cheng MM, Zhao K et al (2019) Res2net: a new multi-scale backbone architecture. IEEE Trans Pattern Anal Mach Intell 43(2):652\u2013662","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"11614_CR48","doi-asserted-by":"crossref","unstructured":"Zhang Y, Lv Z, Wu H, et al (2022) Mfa-conformer: multi-scale feature aggregation conformer for automatic speaker verification. arXiv preprint arXiv:2203.15249","DOI":"10.21437\/Interspeech.2022-563"},{"key":"11614_CR49","doi-asserted-by":"crossref","unstructured":"He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770\u2013778","DOI":"10.1109\/CVPR.2016.90"},{"key":"11614_CR50","doi-asserted-by":"crossref","unstructured":"Dai Z, Yang Z, Yang Y, et al (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860","DOI":"10.18653\/v1\/P19-1285"},{"key":"11614_CR51","unstructured":"Devlin J, Chang M W, Lee K, et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805"},{"key":"11614_CR52","unstructured":"Oord A, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748"},{"key":"11614_CR53","doi-asserted-by":"publisher","unstructured":"Chen Z, Zhang Y, Rosenberg A, Ramabhadran B, Wang G, Moreno P (2021) Injecting text in self-supervised speech pretraining. In: IEEE automatic speech recognition and understanding workshop (ASRU). Cartagena, Colombia pp 251\u2013258. https:\/\/doi.org\/10.1109\/ASRU51503.2021.9688018","DOI":"10.1109\/ASRU51503.2021.9688018"},{"issue":"1","key":"11614_CR54","first-page":"1929","volume":"15","author":"N Srivastava","year":"2014","unstructured":"Srivastava N, Hinton G, Krizhevsky A et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929\u20131958","journal-title":"J Mach Learn Res"},{"key":"11614_CR55","unstructured":"Hermans JR, Spanakis G, M\u00f6ckel R (2017) Accumulated gradient normalization. In: Asian conference on machine learning. PMLR, pp 439\u2013454"},{"key":"11614_CR56","doi-asserted-by":"publisher","unstructured":"Karita S, et al (2019) A comparative study on transformer vs rnn in speech applications. In: IEEE automatic speech recognition and understanding workshop (ASRU), Singapore, pp 449\u2013456, https:\/\/doi.org\/10.1109\/ASRU46091.2019.9003750","DOI":"10.1109\/ASRU46091.2019.9003750"},{"key":"11614_CR57","doi-asserted-by":"crossref","unstructured":"An K, Xiang H, Ou Z (2020) CAT: a CTC-CRF based ASR toolkit bridging the hybrid and the end-to-end approaches towards data efficiency and low latency. arXiv preprint arXiv:2005.13326","DOI":"10.21437\/Interspeech.2020-2732"},{"key":"11614_CR58","doi-asserted-by":"crossref","unstructured":"Watanabe S, Hori T, Karita S, et al (2018) Espnet: end-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015","DOI":"10.21437\/Interspeech.2018-1456"},{"key":"11614_CR59","doi-asserted-by":"crossref","unstructured":"An K, Shi X, Zhang S (2023) BAT: boundary aware transducer for memory-efficient and low-latency ASR. arXiv preprint arXiv:2305.11571","DOI":"10.21437\/Interspeech.2023-770"},{"key":"11614_CR60","doi-asserted-by":"crossref","unstructured":"Gao Z, Li Z, Wang J, et al (2023) FunASR: a fundamental end-to-end speech recognition toolkit. arXiv preprint arXiv:2305.11013","DOI":"10.21437\/Interspeech.2023-1428"},{"key":"11614_CR61","doi-asserted-by":"crossref","unstructured":"Fang Y, Li X (2023) Unimodal aggregation for CTC-based speech recognition. arXiv preprint arXiv:2309.08150","DOI":"10.1109\/ICASSP48485.2024.10448248"},{"issue":"11","key":"11614_CR62","doi-asserted-by":"publisher","first-page":"1949","DOI":"10.1109\/TASLP.2018.2848701","volume":"26","author":"H Hadian","year":"2018","unstructured":"Hadian H, Sameti H, Povey D, Khudanpur S (2018) Flatstart single-stage discriminatively trained HMM-based models for ASR. IEEE\/ACM Trans Audio Speech Lang Process 26(11):1949\u20131961","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"11614_CR63","doi-asserted-by":"crossref","unstructured":"Zheng H, An K, Ou Z (2021) Efficient neural architecture search for end-to-end speech recognition via straight-through gradients. In: 2021 IEEE spoken language technology workshop (SLT). IEEE, pp 60\u201367","DOI":"10.1109\/SLT48900.2021.9383527"},{"key":"11614_CR64","unstructured":"Zeghidour N, Xu Q, Liptchinsky V, Usunier N, Synnaeve G, Collobert R (2018) Fully convolutional speech recognition. arXiv preprint arXiv:1812.06864"}],"container-title":["Neural Processing Letters"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11063-024-11614-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11063-024-11614-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11063-024-11614-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,15]],"date-time":"2024-07-15T11:18:01Z","timestamp":1721042281000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11063-024-11614-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,8]]},"references-count":64,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2024,6]]}},"alternative-id":["11614"],"URL":"https:\/\/doi.org\/10.1007\/s11063-024-11614-z","relation":{},"ISSN":["1573-773X"],"issn-type":[{"value":"1573-773X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,5,8]]},"assertion":[{"value":"6 April 2024","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 May 2024","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"We the authors of this manuscript entitled \u201cMulti-view self-supervised learning and multi-scale feature fusion for automatic speech recognition\u201d declare that we have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}},{"value":"The submission of this article has been approved by all authors, and the data used in this article has been agreed by the relevant authorities and does not raise issues such as privacy and information security.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Informed consent"}}],"article-number":"168"}}