{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T01:04:48Z","timestamp":1760144688503,"version":"build-2065373602"},"reference-count":23,"publisher":"MDPI AG","issue":"10","license":[{"start":{"date-parts":[[2024,5,16]],"date-time":"2024-05-16T00:00:00Z","timestamp":1715817600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Natural Science Foundation of China","award":["92267301","62071058","CX2023305"],"award-info":[{"award-number":["92267301","62071058","CX2023305"]}]},{"name":"BUPT Excellent Ph.D. Students Foundation","award":["92267301","62071058","CX2023305"],"award-info":[{"award-number":["92267301","62071058","CX2023305"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>We consider the problem of learned speech transmission. Existing methods have exploited joint source\u2013channel coding (JSCC) to encode speech directly to transmitted symbols to improve the robustness over noisy channels. However, the fundamental limit of these methods is the failure of identification of content diversity across speech frames, leading to inefficient transmission. In this paper, we propose a novel neural speech transmission framework named NST. It can be optimized for superior rate\u2013distortion\u2013perception (RDP) performance toward the goal of high-fidelity semantic communication. Particularly, a learned entropy model assesses latent speech features to quantify the semantic content complexity, which facilitates the adaptive transmission rate allocation. NST enables a seamless integration of the source content with channel state information through variable-length joint source\u2013channel coding, which maximizes the coding gain. Furthermore, we present a streaming variant of NST, which adopts causal coding based on sliding windows. Experimental results verify that NST outperforms existing speech transmission methods including separation-based and JSCC solutions in terms of RDP performance. Streaming NST achieves low-latency transmission with a slight quality degradation, which is tailored for real-time speech communication.<\/jats:p>","DOI":"10.3390\/s24103169","type":"journal-article","created":{"date-parts":[[2024,5,16]],"date-time":"2024-05-16T09:30:03Z","timestamp":1715851803000},"page":"3169","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Rate\u2013Distortion\u2013Perception Optimized Neural Speech Transmission System for High-Fidelity Semantic Communications"],"prefix":"10.3390","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5463-8614","authenticated-orcid":false,"given":"Shengshi","family":"Yao","sequence":"first","affiliation":[{"name":"Key Laboratory of Universal Wireless Communications, Beijing University of Posts and Telecommunications, Beijing 100876, China"}]},{"given":"Zixuan","family":"Xiao","sequence":"additional","affiliation":[{"name":"Key Laboratory of Universal Wireless Communications, Beijing University of Posts and Telecommunications, Beijing 100876, China"}]},{"given":"Kai","family":"Niu","sequence":"additional","affiliation":[{"name":"Key Laboratory of Universal Wireless Communications, Beijing University of Posts and Telecommunications, Beijing 100876, China"},{"name":"Department of Broadband Communication, Peng Cheng Laboratory, Shenzhen 518066, China"}]}],"member":"1968","published-online":{"date-parts":[[2024,5,16]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"2434","DOI":"10.1109\/JSAC.2021.3087240","article-title":"Semantic communication systems for speech transmission","volume":"39","author":"Weng","year":"2021","journal-title":"IEEE J. Sel. Areas Commun."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"245","DOI":"10.1109\/JSAC.2022.3221952","article-title":"Semantic-preserved communication system for highly efficient speech transmission","volume":"41","author":"Han","year":"2022","journal-title":"IEEE J. Sel. Areas Commun."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Guo, J., Zhang, Y., Liu, C., Xu, W., and Bie, Z. (2023, January 5\u20138). SNR-Adaptive Multi-Layer Semantic Communication for Speech. Proceedings of the 2023 IEEE 34th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Toronto, ON, Canada.","DOI":"10.1109\/PIMRC56721.2023.10293992"},{"key":"ref_4","unstructured":"Qin, Z., Tao, X., Lu, J., Tong, W., and Li, G.Y. (2021). Semantic communications: Principles and challenges. arXiv."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"170","DOI":"10.1109\/MWC.017.2100705","article-title":"Communication beyond transmitting bits: Semantics-guided source and channel coding","volume":"30","author":"Dai","year":"2023","journal-title":"IEEE Wirel. Commun."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"42","DOI":"10.1109\/MCOM.004.2200819","article-title":"Deep joint source-channel coding for semantic communications","volume":"61","author":"Xu","year":"2023","journal-title":"IEEE Commun. Mag."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1109\/COMST.2023.3333342","article-title":"Semantics-empowered communications: A tutorial-cum-survey","volume":"26","author":"Lu","year":"2023","journal-title":"IEEE Commun. Surv. Tutor."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"567","DOI":"10.1109\/TCCN.2019.2919300","article-title":"Deep joint source-channel coding for wireless image transmission","volume":"5","author":"Bourtsoulatze","year":"2019","journal-title":"IEEE Trans. Cogn. Commun. Netw."},{"key":"ref_9","unstructured":"Ball\u00e9, J., Laparra, V., and Simoncelli, E.P. (2017, January 24\u201326). End-to-end optimized image compression. Proceedings of the 5th International Conference on Learning Representations, Toulon, France."},{"key":"ref_10","unstructured":"Ball\u00e9, J., Minnen, D., Singh, S., Hwang, S.J., and Johnston, N. (May, January 30). Variational image compression with a scale hyperprior. Proceedings of the International Conference on Learning Representations, Vancouver, QC, Canada."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Xiao, Z., Yao, S., Dai, J., Wang, S., Niu, K., and Zhang, P. (2023, January 4\u201310). Wireless deep speech semantic transmission. Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece.","DOI":"10.1109\/ICASSP49357.2023.10094680"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"162","DOI":"10.1109\/TCOM.1964.1088973","article-title":"Dither signals and their effect on quantization noise","volume":"12","author":"Schuchman","year":"1964","journal-title":"IEEE Trans. Commun. Technol."},{"key":"ref_13","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017): 31st Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_14","unstructured":"Dai, Z., Yang, Z., Yang, Y., Carbonell, J.G., Le, Q., and Salakhutdinov, R. (August, January 28). Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy."},{"key":"ref_15","unstructured":"Garofolo, J.S. (1993). Timit Acoustic Phonetic Continuous Speech Corpus, Linguistic Data Consortium."},{"key":"ref_16","unstructured":"Muda, L., Begam, M., and Elamvazuthi, I. (2010). Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. arXiv."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"620","DOI":"10.1109\/TSA.2002.804299","article-title":"The adaptive multirate wideband speech codec (AMR-WB)","volume":"10","author":"Bessette","year":"2002","journal-title":"IEEE Trans. Speech Audio Process."},{"key":"ref_18","unstructured":"Valin, J.M., Vos, K., and Terriberry, T. (2022, July 01). Definition of the Opus Audio Codec, Technical Report. Available online: https:\/\/www.rfc-editor.org\/rfc\/pdfrfc\/rfc6716.txt.pdf."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Ryan, W., and Lin, S. (2009). Channel Codes: Classical and Modern, Cambridge University Press.","DOI":"10.1017\/CBO9780511803253"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Peng, F., Zhang, J., and Ryan, W.E. (2007, January 11\u201315). Adaptive modulation and coding for IEEE 802.11 n. Proceedings of the 2007 IEEE Wireless Communications and Networking Conference, Hong Kong.","DOI":"10.1109\/WCNC.2007.126"},{"key":"ref_21","unstructured":"ITU-T (2001). Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs, International Telecommunication Union."},{"key":"ref_22","unstructured":"BS Series (2014). Method for the Subjective Assessment of Intermediate Quality Level of Audio Systems, International Telecommunication Union."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"92","DOI":"10.1109\/MWC.2012.6393523","article-title":"The COST 2100 MIMO channel model","volume":"19","author":"Liu","year":"2012","journal-title":"IEEE Wirel. Commun."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/10\/3169\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T14:43:32Z","timestamp":1760107412000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/10\/3169"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,16]]},"references-count":23,"journal-issue":{"issue":"10","published-online":{"date-parts":[[2024,5]]}},"alternative-id":["s24103169"],"URL":"https:\/\/doi.org\/10.3390\/s24103169","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2024,5,16]]}}}