{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,2]],"date-time":"2026-01-02T12:44:59Z","timestamp":1767357899497,"version":"3.48.0"},"reference-count":49,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2026,1,1]],"date-time":"2026-01-01T00:00:00Z","timestamp":1767225600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["JSAN"],"abstract":"<jats:p>Seconds count differently for people in danger. We present a real-time streaming pipeline for audio-based detection of hazardous life events affecting life and property. The system operates online rather than as a retrospective analysis tool. Its objective is to reduce the latency between the occurrence of a crime, conflict, or accident and the corresponding response by authorities. The key idea is to map reality as perceived by audio into a written story and question the text via a large language model. The method integrates streaming, zero-shot algorithms in an online decoding mode that convert sound into short, interpretable tokens, which are processed by a lightweight language model. CLAP text\u2013audio prompting identifies agitation, panic, and distress cues, combined with conversational dynamics derived from speaker diarization. Lexical information is obtained through streaming automatic speech recognition, while general audio events are detected by a streaming version of Audio Spectrogram Transformer tagger. Prosodic features are incorporated using pitch- and energy-based rules derived from robust F0 tracking and periodicity measures. The system uses a large language model configured for online decoding and outputs binary (YES\/NO) life-threatening risk decisions every two seconds, along with a brief justification and a final session-level verdict. The system emphasizes interpretability and accountability. We evaluate it on a subset of the X-Violence dataset, comprising only real-world videos. We release code, prompts, decision policies, evaluation splits, and example logs to enable the community to replicate, critique, and extend our blueprint.<\/jats:p>","DOI":"10.3390\/jsan15010006","type":"journal-article","created":{"date-parts":[[2026,1,2]],"date-time":"2026-01-02T12:23:44Z","timestamp":1767356624000},"page":"6","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["From Sound to Risk: Streaming Audio Flags for Real-World Hazard Inference Based on AI"],"prefix":"10.3390","volume":"15","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7111-3331","authenticated-orcid":false,"given":"Ilyas","family":"Potamitis","sequence":"first","affiliation":[{"name":"Department of Music Technology and Acoustics, Hellenic Mediterranean University, 71410 Heraklion, Greece"}]}],"member":"1968","published-online":{"date-parts":[[2026,1,1]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"101716","DOI":"10.1016\/j.iot.2025.101716","article-title":"From lab to field: Real-world evaluation of an AI-driven Smart Video Solution to enhance community safety","volume":"33","author":"Yao","year":"2025","journal-title":"Internet Things"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Ariel, B., Bland, M., and Sutherland, A. (2017). Lowering the threshold of effective deterrence\u2019-Testing the effect of private security agents in public spaces on crime: A randomized controlled trial in a mass transit system. PLoS ONE, 12.","DOI":"10.1371\/journal.pone.0187392"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"98","DOI":"10.1016\/j.drugalcdep.2009.12.002","article-title":"The cost of crime to society: New crime-specific estimates for policy and program evaluation","volume":"108","author":"McCollister","year":"2010","journal-title":"Drug Alcohol Depend."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Wu, P., Liu, J., Shi, Y., Sun, Y., Shao, F., Wu, Z., and Yang, Z. (2020). Not only look, but also listen: Learning multimodal violence detection under weak supervision. Computer Vision\u2014ECCV 2020, Springer International Publishing.","DOI":"10.1007\/978-3-030-58577-8_20"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Pang, W.-F., He, Q.-H., Hu, Y.-J., and Li, Y.-X. (2021, January 6\u201311). Violence detection in videos based on fusing visual and audio information. Proceedings of the ICASSP 2021, Toronto, ON, Canada.","DOI":"10.1109\/ICASSP39728.2021.9413686"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"4922","DOI":"10.1109\/TMM.2022.3184533","article-title":"Audiovisual dependency attention for violence detection in videos","volume":"25","author":"Pang","year":"2023","journal-title":"IEEE Trans. Multimed."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"1674","DOI":"10.1109\/TMM.2022.3147369","article-title":"Weakly Supervised Audio-Visual Violence Detection","volume":"25","author":"Wu","year":"2023","journal-title":"IEEE Trans. Multimed."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"105286","DOI":"10.1016\/j.imavis.2024.105286","article-title":"Learning weakly supervised audio-visual violence detection in hyperbolic space","volume":"151","author":"Zhou","year":"2024","journal-title":"Image Vis. Comput."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Xiao, Y., Gao, G., Wang, L., and Lai, H. (2022). Optical Flow-Aware-Based Multi-Modal Fusion Network for Violence Detection. Entropy, 24.","DOI":"10.3390\/e24070939"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Negre, P., Alonso, R.S., Gonz\u00e1lez-Briones, A., Prieto, J., and Rodr\u00edguez-Gonz\u00e1lez, S. (2024). Literature Review of Deep-Learning-Based Detection of Violence in Video. Sensors, 24.","DOI":"10.3390\/s24124016"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Vijeikis, R., Raudonis, V., and Dervinis, G. (2022). Efficient Violence Detection in Surveillance. Sensors, 22.","DOI":"10.3390\/s22062216"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"127489","DOI":"10.1016\/j.neucom.2024.127489","article-title":"Audio\u2013visual representation learning for anomaly events detection in crowds","volume":"582","author":"Gao","year":"2024","journal-title":"Neurocomputing"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Wu, P., Zhou, X., Pang, G., Zhou, L., Yan, Q., Wang, P., and Zhang, Y. (2024, January 26\u201327). VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection. Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada.","DOI":"10.1609\/aaai.v38i6.28423"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"244","DOI":"10.1016\/j.procs.2023.08.162","article-title":"Violence detection in real-life audio signals using lightweight deep neural networks","volume":"222","author":"Bakhshi","year":"2023","journal-title":"Procedia Comput. Sci."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"109638","DOI":"10.1016\/j.apacoust.2023.109638","article-title":"Computationally constrained audio-based violence detection through transfer learning and data augmentation techniques","volume":"213","year":"2023","journal-title":"Appl. Acoust."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Gong, Y., Chung, Y.-A., and Glass, J. (September, January 30). AST: Audio spectrogram transformer. Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czech Republic.","DOI":"10.21437\/Interspeech.2021-698"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Chen, K., Du, X., Zhu, B., Ma, Z., Berg-Kirkpatrick, T., and Dubnov, S. (2022, January 23\u201327). HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection. Proceedings of the ICASSP 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore.","DOI":"10.1109\/ICASSP43922.2022.9746312"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Dinkel, H., Yan, Z., Wang, Y.-Q., Zhang, J., Wang, Y.-J., and Wang, B. (2024). Streaming Audio Transformers for Online Audio Tagging. Proc. Interspeech, 1145\u20131149.","DOI":"10.21437\/Interspeech.2024-242"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"3292","DOI":"10.1109\/TASLP.2021.3120633","article-title":"PSLA: Improving audio tagging with pretraining, sampling, labeling, and aggregation","volume":"29","author":"Gong","year":"2021","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"2880","DOI":"10.1109\/TASLP.2020.3030497","article-title":"PANNs: Large-scale pretrained audio neural networks for audio pattern recognition","volume":"28","author":"Kong","year":"2020","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_21","unstructured":"Elizalde, B., Deshmukh, S., Al Ismail, M., and Wang, H. (2024, January 14\u201319). Natural language supervision for general-purpose audio representations. Proceedings of the IEEE ICASSP, Seoul, Republic of Korea."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Wu, H.-H., Seetharaman, P., Kumar, K., and Bello, J.P. (2022, January 23\u201327). Wav2CLIP: Learning robust audio representations from CLIP. Proceedings of the ICASSP, Singapore.","DOI":"10.31219\/osf.io\/r2vwf"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"13300","DOI":"10.1038\/s41598-022-17497-1","article-title":"Gun identification from gunshot audios for secure public places using transformer learning","volume":"12","author":"Nijhawan","year":"2022","journal-title":"Sci. Rep."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Singh, R.B., Zhuang, H., and Pawani, J.K. (2021). Data Collection, Modeling, and Classification for Gunshot and Gunshot-like Audio Events: A Case Study. Sensors, 21.","DOI":"10.3390\/s21217320"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"625","DOI":"10.1016\/j.scijus.2024.09.007","article-title":"Gunshots detection, identification, and classification: Applications to forensic science","volume":"64","author":"Teng","year":"2024","journal-title":"Sci. Justice"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"134230","DOI":"10.1109\/ACCESS.2022.3231681","article-title":"Sound event detection for human safety and security in noisy environments","volume":"10","author":"Neri","year":"2022","journal-title":"IEEE Access"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Schewski, L., Doss, M.M., Beldi, G., and Keller, S. (2025). Measuring negative emotions and stress through acoustic correlates in speech: A systematic review. PLoS ONE, 20.","DOI":"10.1371\/journal.pone.0328833"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Opladen, V., Tanck, J.A., Baur, J., Hartmann, A.S., Svaldi, J., and Vocks, S. (2023). Body exposure and vocal analysis: Validation of fundamental frequency as a correlate of emotional arousal and valence. Front. Psychiatry, 14.","DOI":"10.3389\/fpsyt.2023.1087548"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Zhang, J., Yin, H., Zhang, J., Yang, G., Qin, J., and He, L. (2022). Real-time mental stress detection using multimodality expressions with a deep learning framework. Front. Neurosci., 16.","DOI":"10.3389\/fnins.2022.947168"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"1694","DOI":"10.1080\/00140139.2024.2430370","article-title":"Daily stress detection from real-life speeches using acoustic and semantic information","volume":"68","author":"Lu","year":"2025","journal-title":"Ergonomics"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Baird, A., Triantafyllopoulos, A., Z\u00e4nkert, S., Ottl, S., Christ, L., Stappen, L., Konzok, J., Sturmbauer, S., Me\u00dfner, E.-M., and Kudielka, B.M. (2021). An evaluation of speech-based recognition of emotional state and arousal under stress. Front. Comput. Sci., 3.","DOI":"10.3389\/fcomp.2021.750284"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Fr\u00fchholz, S., Dietziker, J., Staib, M., and Trost, W. (2021). Neurocognitive processing efficiency for discriminating human non-alarm rather than alarm scream calls. PLoS Biol., 19.","DOI":"10.1371\/journal.pbio.3000751"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Allouch, M., Mansbach, N., Azaria, A., and Azoulay, R. (2023). Utilizing Machine Learning for Detecting Harmful Situations by Audio and Text. Appl. Sci., 13.","DOI":"10.3390\/app13063927"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Saradopoulos, I., Potamitis, I., Ntalampiras, S., Rigakis, I., Manifavas, C., and Konstantaras, A. (2025). Real-Time Acoustic Detection of Critical Incidents in Smart Cities Using Artificial Intelligence and Edge Networks. Sensors, 25.","DOI":"10.3390\/s25082597"},{"key":"ref_35","unstructured":"(2025, November 28). X-Violence Dataset. Available online: https:\/\/huggingface.co\/datasets\/jherng\/xd-violence\/tree\/main\/data\/video."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Elizalde, B., Deshmukh, S., Ismail, M.A., and Wang, H. (2023, January 4\u201310). CLAP Learning Audio Concepts from Natural Language Supervision. Proceedings of the ICASSP 2023\u20142023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece.","DOI":"10.1109\/ICASSP49357.2023.10095889"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Wu, Y., Chen, K., Zhang, T., Hui, Y., Berg-Kirkpatrick, T., and Dubnov, S. (2023, January 4\u201310). Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation. Proceedings of the ICASSP 2023\u20142023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece.","DOI":"10.1109\/ICASSP49357.2023.10095969"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Kim, J.W., Salamon, J., Li, P., and Bello, J.P. (2018, January 15\u201320). Crepe: A Convolutional Representation for Pitch Estimation. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461329"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"1917","DOI":"10.1121\/1.1458024","article-title":"YIN, a fundamental frequency estimator for speech and music","volume":"111","author":"Kawahara","year":"2002","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Mauch, M., and Dixon, S. (2014, January 4\u20139). PYIN: A fundamental frequency estimator using probabilistic threshold distributions. Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy.","DOI":"10.1109\/ICASSP.2014.6853678"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"227","DOI":"10.1016\/S0167-6393(02)00084-5","article-title":"Vocal communication of emotion: A review of research paradigms","volume":"40","author":"Scherer","year":"2003","journal-title":"Speech Commun."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"526","DOI":"10.1109\/TIFS.2019.2925452","article-title":"Selective Audio Adversarial Example in Evasion Attack on Speech Recognition System","volume":"15","author":"Kwon","year":"2020","journal-title":"IEEE Trans. Inf. Forensics Secur."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"122464","DOI":"10.1109\/ACCESS.2022.3216075","article-title":"Audio Adversarial Example Detection Using the Audio Style Transfer Learning Method","volume":"13","author":"Kwon","year":"2025","journal-title":"IEEE Access"},{"key":"ref_44","unstructured":"Harari, Y.N. (2014). Sapiens: A Brief History of Humankind, Harvill Secker."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Tian, Y., Pang, G., Chen, Y., Singh, R., Verjans, J.W., and Carneiro, G. (2021, January 10\u201317). Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning. Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00493"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Chen, Y., Liu, Z., Zhang, B., Fok, W., Qi, X., and Wu, Y.-C. (2023, January 7\u201314). MGFN: Magnitude-Contrastive Glance-and-Focus Network for Weakly-Supervised Video Anomaly Detection. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Washington, DC, USA.","DOI":"10.1609\/aaai.v37i1.25112"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Yu, J., Liu, J., Cheng, Y., Feng, R., and Zhang, Y. (2022, January 10\u201314). Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection. Proceedings of the 30th ACM International Conference on Multimedia (ACM MM), Lisboa, Portugal.","DOI":"10.1145\/3503161.3547868"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Li, M., Sang, J., Lu, Y., and Du, L. (2025). WSVAD-CLIP: Temporally Aware and Prompt Learning with CLIP for Weakly Supervised Video Anomaly Detection. J. Imaging, 11.","DOI":"10.3390\/jimaging11100354"},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2871183","article-title":"Audio Surveillance: A Systematic Review","volume":"48","author":"Crocco","year":"2016","journal-title":"ACM Comput. Surv."}],"container-title":["Journal of Sensor and Actuator Networks"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2224-2708\/15\/1\/6\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,2]],"date-time":"2026-01-02T12:42:00Z","timestamp":1767357720000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2224-2708\/15\/1\/6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,1]]},"references-count":49,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2026,2]]}},"alternative-id":["jsan15010006"],"URL":"https:\/\/doi.org\/10.3390\/jsan15010006","relation":{},"ISSN":["2224-2708"],"issn-type":[{"value":"2224-2708","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,1]]}}}