{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,1]],"date-time":"2026-02-01T04:16:46Z","timestamp":1769919406050,"version":"3.49.0"},"reference-count":60,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2022,6,22]],"date-time":"2022-06-22T00:00:00Z","timestamp":1655856000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100000780","name":"the European Commission funded project Humane AI: Toward AI Systems That Augment and Empower Humans by Understanding Us, our Society and the World Around Us","doi-asserted-by":"publisher","award":["820437"],"award-info":[{"award-number":["820437"]}],"id":[{"id":"10.13039\/501100000780","id-type":"DOI","asserted-by":"publisher"}]},{"name":"the \u201cApplication Domain Specific Highly Reliable IT Solutions\u201d project","award":["820437"],"award-info":[{"award-number":["820437"]}]},{"name":"the Ministry of Innovation and Technology NRDI Office","award":["820437"],"award-info":[{"award-number":["820437"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MTI"],"abstract":"<jats:p>We consider, evaluate, and develop methods for home rehabilitation scenarios. We show the required modules for this scenario. Due to the large number of modules, the framework falls into the category of Composite AI. Our work is based on collected videos with high-quality execution and samples of typical errors. They are augmented by sample dialogues about the exercise to be executed and the assumed errors. We study and discuss body pose estimation technology, dialogue systems of different kinds and the emerging constraints of verbal communication. We demonstrate that the optimization of the camera and the body pose allows high-precision recording and requires the following components: (1) optimization needs a 3D representation of the environment, (2) a navigation dialogue to guide the patient to the optimal pose, (3) semantic and instance maps are necessary for verbal instructions about the navigation. We put forth different communication methods, from video-based presentation to chit-chat-like dialogues through rule-based methods. We discuss the methods for different aspects of the challenges that can improve the performance of the individual components. Due to the emerging solutions, we claim that the range of applications will drastically grow in the very near future.<\/jats:p>","DOI":"10.3390\/mti6070048","type":"journal-article","created":{"date-parts":[[2022,6,22]],"date-time":"2022-06-22T23:11:19Z","timestamp":1655939479000},"page":"48","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":15,"title":["AI Technologies for Machine Supervision and Help in a Rehabilitation Scenario"],"prefix":"10.3390","volume":"6","author":[{"given":"G\u00e1bor","family":"Baranyi","sequence":"first","affiliation":[{"name":"Department of Artificial Intelligence, Faculty of Informatics, E\u00f6tv\u00f6s Lor\u00e1nd University, 1083 Budapest, Hungary"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2839-8992","authenticated-orcid":false,"given":"Bruno Carlos","family":"Dos Santos Mel\u00edcio","sequence":"additional","affiliation":[{"name":"Department of Artificial Intelligence, Faculty of Informatics, E\u00f6tv\u00f6s Lor\u00e1nd University, 1083 Budapest, Hungary"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zs\u00f3fia","family":"Ga\u00e1l","sequence":"additional","affiliation":[{"name":"Department of Artificial Intelligence, Faculty of Informatics, E\u00f6tv\u00f6s Lor\u00e1nd University, 1083 Budapest, Hungary"},{"name":"Emineo Private Hospital, 1016 Budapest, Hungary"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9716-9176","authenticated-orcid":false,"given":"Levente","family":"Hajder","sequence":"additional","affiliation":[{"name":"Department of Artificial Intelligence, Faculty of Informatics, E\u00f6tv\u00f6s Lor\u00e1nd University, 1083 Budapest, Hungary"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4303-4054","authenticated-orcid":false,"given":"Andr\u00e1s","family":"Simonyi","sequence":"additional","affiliation":[{"name":"Department of Artificial Intelligence, Faculty of Informatics, E\u00f6tv\u00f6s Lor\u00e1nd University, 1083 Budapest, Hungary"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"D\u00e1niel","family":"Sindely","sequence":"additional","affiliation":[{"name":"Department of Artificial Intelligence, Faculty of Informatics, E\u00f6tv\u00f6s Lor\u00e1nd University, 1083 Budapest, Hungary"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Joul","family":"Skaf","sequence":"additional","affiliation":[{"name":"Department of Artificial Intelligence, Faculty of Informatics, E\u00f6tv\u00f6s Lor\u00e1nd University, 1083 Budapest, Hungary"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1415-1702","authenticated-orcid":false,"given":"Ond\u0159ej","family":"Du\u0161ek","sequence":"additional","affiliation":[{"name":"Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, 118 00 Prague, Czech Republic"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tom\u00e1\u0161","family":"Nekvinda","sequence":"additional","affiliation":[{"name":"Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, 118 00 Prague, Czech Republic"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1280-3447","authenticated-orcid":false,"given":"Andr\u00e1s","family":"L\u0151rincz","sequence":"additional","affiliation":[{"name":"Department of Artificial Intelligence, Faculty of Informatics, E\u00f6tv\u00f6s Lor\u00e1nd University, 1083 Budapest, Hungary"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2022,6,22]]},"reference":[{"key":"ref_1","unstructured":"Gartner Group (2022, May 22). 5 Trends Drive the Gartner Hype Cycle for Emerging Technologies. Available online: https:\/\/www.gartner.com\/smarterwithgartner\/5-trends-drive-the-gartner-hype-cycle-for-emerging-technologies-2020."},{"key":"ref_2","unstructured":"iHealthcareAnalyst, Inc (2022, May 22). Global Home Rehabilitation Market $225 Billion by 2027. Available online: https:\/\/bit.ly\/3Ox9WOm."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Der Loos, V., Machiel, H., Reinkensmeyer, D.J., and Guglielmelli, E. (2016). Rehabilitation and health care robotics. Springer Handbook of Robotics, Springer.","DOI":"10.1007\/978-3-319-32552-1_64"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"181","DOI":"10.3389\/frobt.2021.612331","article-title":"Robotic home-based rehabilitation systems design: From a literature review to a conceptual framework for community-based remote therapy during COVID-19 pandemic","volume":"8","author":"Akbari","year":"2021","journal-title":"Front. Robot. AI"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Yedidsion, H., Deans, J., Sheehan, C., Chillara, M., Hart, J., Stone, P., and Mooney, R.J. (2019). Optimal use of verbal instructions for multi-robot human navigation guidance. International Conference on Social Robotics, Springer.","DOI":"10.1007\/978-3-030-35888-4_13"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"258","DOI":"10.1016\/j.cogsys.2018.10.032","article-title":"Robot-enabled support of daily activities in smart home environments","volume":"54","author":"Wilson","year":"2019","journal-title":"Cogn. Syst. Res."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"487","DOI":"10.1093\/geront\/gnaa163","article-title":"Retooling the health care workforce for an aging America: A current perspective","volume":"61","author":"Foley","year":"2021","journal-title":"Gerontol."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"103833","DOI":"10.1016\/j.robot.2021.103833","article-title":"A systematic mapping study of robotics in human care","volume":"144","author":"Santos","year":"2021","journal-title":"Robot. Auton. Syst."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"173","DOI":"10.1016\/j.healthpol.2021.09.010","article-title":"Exploration of current challenges in rehabilitation from the perspective of healthcare professionals: Switzerland as a case in point","volume":"126","author":"Spiess","year":"2022","journal-title":"Health Policy"},{"key":"ref_10","unstructured":"Byron, D., Koller, A., Oberlander, J., Stoia, L., and Striegnitz, K. (2007, January 20\u201321). Generating instructions in virtual environments (GIVE): A challenge and an evaluation testbed for NLG. Proceedings of the Workshop on Shared Tasks and Comparative Evaluation in Natural Language Generation, Arlington, VA, USA."},{"key":"ref_11","unstructured":"Anderson, P., Chang, A., Chaplot, D.S., Dosovitskiy, A., Gupta, S., Koltun, V., Kosecka, J., Malik, J., Mottaghi, R., and Savva, M. (2018). On evaluation of embodied navigation agents. arXiv."},{"key":"ref_12","unstructured":"Puig, X., Shu, T., Li, S., Wang, Z., Liao, Y.H., Tenenbaum, J.B., Fidler, S., and Torralba, A. (2021, January 3\u20137). Watch-And-Help: A challenge for social perception and human-AI collaboration. Proceedings of the International Conference on Learning Representations, Virtual."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Rozenberszki, D., S\u00f6r\u00f6s, G., Szeier, S., and Lorincz, A. (2021, January 11\u201317). 3D Semantic Label Transfer in Human-Robot Collaboration. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCVW54120.2021.00294"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Cao, Z., Simon, T., Wei, S.E., and Sheikh, Y. (2017, January 22\u201329). Realtime multi-person 2D pose estimation using part affinity fields. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Venice, Italy.","DOI":"10.1109\/CVPR.2017.143"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Sun, K., Xiao, B., Liu, D., and Wang, J. (2019, January 15\u201320). Deep high-resolution representation learning for human pose estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00584"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"102897","DOI":"10.1016\/j.cviu.2019.102897","article-title":"Monocular human pose estimation: A survey of deep learning-based methods","volume":"192","author":"Chen","year":"2020","journal-title":"Comput. Vis. Image Underst."},{"key":"ref_17","unstructured":"Bazarevsky, V., Grishchenko, I., Raveendran, K., Zhu, T., Zhang, F., and Grundmann, M. (2020). BlazePose: On-device Real-time Body Pose tracking. arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2816795.2818013","article-title":"SMPL: A skinned multi-person linear model","volume":"34","author":"Loper","year":"2015","journal-title":"ACM Trans. Graph."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., and Black, M.J. (2019, January 15\u201320). Expressive body capture: 3D hands, face, and body from a single image. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01123"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Choutas, V., Pavlakos, G., Bolkart, T., Tzionas, D., and Black, M.J. (2020). Monocular expressive body regression through body-driven attention. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-030-58607-2_2"},{"key":"ref_21","first-page":"16","article-title":"MeTRAbs: Metric-Scale Truncation-Robust Heatmaps for Absolute 3D Human Pose Estimation","volume":"3","author":"Linder","year":"2020","journal-title":"IEEE Trans. Biom. Behav. Identity Sci."},{"key":"ref_22","unstructured":"Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.L., Yong, M.G., and Lee, J. (2019). Mediapipe: A framework for building perception pipelines. arXiv."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Mangal, N.K., and Tiwari, A.K. (2021). A Review of the Evolution of Scientific Literature on Technology-assisted Approaches using RGB-D sensors for Musculoskeletal Health Monitoring. Computers in Biology and Medicine, Elsevier.","DOI":"10.1016\/j.compbiomed.2021.104316"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"1037","DOI":"10.1001\/jama.2017.1224","article-title":"Effect of inpatient rehabilitation vs a monitored home-based program on mobility in patients with total knee arthroplasty: The HIHO randomized clinical trial","volume":"317","author":"Buhagiar","year":"2017","journal-title":"JAMA"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"468","DOI":"10.1109\/TNSRE.2020.2966249","article-title":"A Deep Learning Framework for Assessing Physical Rehabilitation Exercises","volume":"28","author":"Liao","year":"2020","journal-title":"IEEE Trans. Neural Syst. Rehabil. Eng."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Boyer, P., Burns, D., and Whyne, C. (2021). Out-of-Distribution Detection of Human Activity Recognition with Smartwatch Inertial Sensors. Sensors, 21.","DOI":"10.3390\/s21051669"},{"key":"ref_27","unstructured":"Muoio, D. (2022, May 22). Hinge Health Now Valued at $3B Following $300M Series D. Available online: https:\/\/www.mobihealthnews.com\/news\/hinge-health-now-valued-3b-following-300m-series-d."},{"key":"ref_28","unstructured":"Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., and Malik, J. (November, January 27). Habitat: A platform for embodied AI research. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Korea."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Sun, S., Galley, M., Chen, Y.C., Brockett, C., Gao, X., Gao, J., Liu, J., and Dolan, B. (2020, January 5\u201310). DIALOGPT: Large-Scale Generative Pre-training for Conversational Response Generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online.","DOI":"10.18653\/v1\/2020.acl-demos.30"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"M\u00fcller, M., and Koltun, V. (June, January 30). Openbot: Turning smartphones into robots. Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi\u2019an, China.","DOI":"10.1109\/ICRA48506.2021.9561788"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"znab134.562","DOI":"10.1093\/bjs\/znab134.562","article-title":"542 The Attune Total Knee Replacement: Early Clinical Performance Versus an Established Implant At 3 Years Post-Surgery","volume":"108","author":"Gunn","year":"2021","journal-title":"Br. J. Surg."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Byrne, B., Krishnamoorthi, K., Sankar, C., Neelakantan, A., Goodrich, B., Duckworth, D., Yavuz, S., Dubey, A., Kim, K.Y., and Cedilnik, A. (2019, January 3\u20137). Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China.","DOI":"10.18653\/v1\/D19-1459"},{"key":"ref_33","unstructured":"Mosig, J.E.M., Mehri, S., and Kober, T. (2020). STAR: A Schema-Guided Dialog Dataset for Transfer Learning. arXiv."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Rosinol, A., Abate, M., Chang, Y., and Carlone, L. (2020, January 23\u201327). Kimera: An open-source library for real-time metric-semantic localization and mapping. Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA.","DOI":"10.1109\/ICRA40945.2020.9196885"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"He, K., Gkioxari, G., Doll\u00e1r, P., and Girshick, R.B. (2017). Mask R-CNN. arXiv.","DOI":"10.1109\/ICCV.2017.322"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"107193","DOI":"10.1016\/j.patcog.2019.107193","article-title":"UcoSLAM: Simultaneous localization and mapping by fusion of keypoints and squared planar markers","volume":"101","year":"2020","journal-title":"Pattern Recognit."},{"key":"ref_37","unstructured":"Straub, J., Whelan, T., Ma, L., Chen, Y., Wijmans, E., Green, S., Engel, J.J., Mur-Artal, R., Ren, C., and Verma, S. (2019). The Replica Dataset: A Digital Replica of Indoor Spaces. arXiv."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Gasparetto, A., Boscariol, P., Lanzutti, A., and Vidoni, R. (2015). Path planning and trajectory planning algorithms: A general overview. Motion and Operation Planning of Robotic Systems, Springer.","DOI":"10.1007\/978-3-319-14705-5_1"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Hartley, R., and Zisserman, A. (2003). Multiple View Geometry in Computer Vision, Cambridge University Press.","DOI":"10.1017\/CBO9780511811685"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Ito, Y. (2015). Delaunay Triangulation. Encyclopedia of Applied and Computational Mathematics, Springer.","DOI":"10.1007\/978-3-540-70529-1_314"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., S\u00fcnderhauf, N., Reid, I.D., Gould, S., and van den Hengel, A. (2017). Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments. arXiv.","DOI":"10.1109\/CVPR.2018.00387"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"McTear, M. (2020). Conversational AI: Dialogue Systems, Conversational Agents, and Chatbots, Morgan & Claypool Publishers.","DOI":"10.1007\/978-3-031-02176-3"},{"key":"ref_43","unstructured":"Yogatama, D., Dyer, C., Ling, W., and Blunsom, P. (2017). Generative and Discriminative Text Classification with Recurrent Neural Networks. arXiv."},{"key":"ref_44","unstructured":"Ng, A.Y., and Jordan, M.I. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. Advances in Neural Information Processing Systems, MIT Press."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Shalyminov, I., Sordoni, A., Atkinson, A., and Schulz, H. (2020). Hybrid Generative-Retrieval Transformers for Dialogue Domain Adaptation. arXiv.","DOI":"10.1109\/TASLP.2021.3074779"},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"109","DOI":"10.1016\/S0079-7421(08)60536-8","article-title":"Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem","volume":"24","author":"McCloskey","year":"1989","journal-title":"Psychol. Learn. Motiv. Adv. Res. Theory"},{"key":"ref_47","unstructured":"Li, Y., Su, H., Shen, X., Li, W., Cao, Z., and Niu, S. (December, January 27). DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Taipei, Taiwan."},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Sennrich, R., Haddow, B., and Birch, A. (2016, January 7\u201312). Improving Neural Machine Translation Models with Monolingual Data. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), Berlin, Germany.","DOI":"10.18653\/v1\/P16-1009"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Edunov, S., Ott, M., Auli, M., and Grangier, D. (November, January 31). Understanding Back-Translation at Scale. Proceedings of the 2018 EMNLP, Brussels, Belgium.","DOI":"10.18653\/v1\/D18-1045"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Mousavian, A., Toshev, A., Fi\u0161er, M., Ko\u0161eck\u00e1, J., Wahid, A., and Davidson, J. (2019, January 20\u201324). Visual representations for semantic target driven navigation. Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada.","DOI":"10.1109\/ICRA.2019.8793493"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. (2002, January 7\u201312). Bleu: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Liu, C.W., Lowe, R., Serban, I., Noseworthy, M., Charlin, L., and Pineau, J. (2016, January 1\u20134). How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA.","DOI":"10.18653\/v1\/D16-1230"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Dabhi, M., Wang, C., Saluja, K., Jeni, L.A., Fasel, I., and Lucey, S. (2021, January 1\u20133). High Fidelity 3D Reconstructions with Limited Physical Views. Proceedings of the 2021 International Conference on 3D Vision (3DV), Virtual.","DOI":"10.1109\/3DV53792.2021.00137"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Zhan, Y., Li, F., Weng, R., and Choi, W. (2022). Ray3D: Ray-based 3D human pose estimation for monocular absolute 3D localization. arXiv.","DOI":"10.1109\/CVPR52688.2022.01277"},{"key":"ref_55","unstructured":"Gunasekara, C., Kim, S., D\u2019Haro, L.F., Rastogi, A., Chen, Y.N., Eric, M., Hedayatnia, B., Gopalakrishnan, K., Liu, Y., and Huang, C.W. (2020). Overview of the Ninth Dialog System Technology Challenge: DSTC9. arXiv."},{"key":"ref_56","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_57","first-page":"1877","article-title":"Language models are few-shot learners","volume":"33","author":"Brown","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Gabbay, A., Shamir, A., and Peleg, S. (2018, January 2\u20136). Visual Speech Enhancement. Proceedings of the Interspeech 2018, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1955"},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Gao, R., and Grauman, K. (2021, January 20\u201325). VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency. Proceedings of the 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01524"},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Montesinos, J.F., Kadandale, V.S., and Haro, G. (2022). VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer. arXiv.","DOI":"10.1007\/978-3-031-19836-6_18"}],"container-title":["Multimodal Technologies and Interaction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2414-4088\/6\/7\/48\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T23:37:46Z","timestamp":1760139466000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2414-4088\/6\/7\/48"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,6,22]]},"references-count":60,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2022,7]]}},"alternative-id":["mti6070048"],"URL":"https:\/\/doi.org\/10.3390\/mti6070048","relation":{},"ISSN":["2414-4088"],"issn-type":[{"value":"2414-4088","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,6,22]]}}}