{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,28]],"date-time":"2026-03-28T10:38:25Z","timestamp":1774694305761,"version":"3.50.1"},"reference-count":50,"publisher":"MDPI AG","issue":"10","license":[{"start":{"date-parts":[[2021,5,16]],"date-time":"2021-05-16T00:00:00Z","timestamp":1621123200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>We present SpeakingFaces as a publicly-available large-scale multimodal dataset developed to support machine learning research in contexts that utilize a combination of thermal, visual, and audio data streams; examples include human\u2013computer interaction, biometric authentication, recognition systems, domain transfer, and speech recognition. SpeakingFaces is comprised of aligned high-resolution thermal and visual spectra image streams of fully-framed faces synchronized with audio recordings of each subject speaking approximately 100 imperative phrases. Data were collected from 142 subjects, yielding over 13,000 instances of synchronized data (\u223c3.8 TB). For technical validation, we demonstrate two baseline examples. The first baseline shows classification by gender, utilizing different combinations of the three data streams in both clean and noisy environments. The second example consists of thermal-to-visual facial image translation, as an instance of domain transfer.<\/jats:p>","DOI":"10.3390\/s21103465","type":"journal-article","created":{"date-parts":[[2021,5,17]],"date-time":"2021-05-17T02:31:34Z","timestamp":1621218694000},"page":"3465","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":66,"title":["SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams"],"prefix":"10.3390","volume":"21","author":[{"given":"Madina","family":"Abdrakhmanova","sequence":"first","affiliation":[{"name":"Institute of Smart Systems and Artificial Intelligence, Nazarbayev University, Nur-Sultan 010000, Kazakhstan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6169-8252","authenticated-orcid":false,"given":"Askat","family":"Kuzdeuov","sequence":"additional","affiliation":[{"name":"Institute of Smart Systems and Artificial Intelligence, Nazarbayev University, Nur-Sultan 010000, Kazakhstan"}]},{"given":"Sheikh","family":"Jarju","sequence":"additional","affiliation":[{"name":"Institute of Smart Systems and Artificial Intelligence, Nazarbayev University, Nur-Sultan 010000, Kazakhstan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9422-6833","authenticated-orcid":false,"given":"Yerbolat","family":"Khassanov","sequence":"additional","affiliation":[{"name":"Institute of Smart Systems and Artificial Intelligence, Nazarbayev University, Nur-Sultan 010000, Kazakhstan"}]},{"given":"Michael","family":"Lewis","sequence":"additional","affiliation":[{"name":"School of Engineering and Digital Sciences, Nazarbayev University, Nur-Sultan 010000, Kazakhstan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4042-425X","authenticated-orcid":false,"given":"Huseyin Atakan","family":"Varol","sequence":"additional","affiliation":[{"name":"Institute of Smart Systems and Artificial Intelligence, Nazarbayev University, Nur-Sultan 010000, Kazakhstan"},{"name":"School of Engineering and Digital Sciences, Nazarbayev University, Nur-Sultan 010000, Kazakhstan"}]}],"member":"1968","published-online":{"date-parts":[[2021,5,16]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Chen, Z., Wang, S., and Qian, Y. (2020, January 25\u201329). Multi-modality Matters: A Performance Leap on VoxCeleb. Proceedings of the Interspeech, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-2229"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"245","DOI":"10.1007\/s00138-013-0570-5","article-title":"Thermal cameras and applications: A survey","volume":"25","author":"Gade","year":"2014","journal-title":"Mach. Vis. Appl."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Shon, S., Oh, T., and Glass, J. (2019, January 12\u201317). Noise-tolerant Audio-visual Online Person Verification Using an Attention-based Neural Network Fusion. Proceedings of the ICASSP 2019\u20142019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8683477"},{"key":"ref_4","unstructured":"Afouras, T., Chung, J.S., Senior, A., Vinyals, O., and Zisserman, A. (2018). Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TMM.2020.2975922","article-title":"End-to-End Audiovisual Speech Recognition System With Multitask Learning","volume":"23","author":"Tao","year":"2021","journal-title":"IEEE Trans. Multim."},{"key":"ref_6","unstructured":"(2021, February 24). FLIR ONE Pro. Available online: https:\/\/www.flir.com\/products\/flir-one-pro\/."},{"key":"ref_7","unstructured":"(2021, February 24). CAT S62 Pro. Available online: https:\/\/www.catphones.com\/en-dk\/cat-s62-pro-smartphone\/."},{"key":"ref_8","unstructured":"(2021, February 24). Lepton\u2014LWIR Micro Thermal Camera Module. Available online: https:\/\/www.flir.com\/products\/lepton\/."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"119","DOI":"10.1007\/s12559-012-9163-2","article-title":"A New Face Database Simultaneously Acquired in Visible, Near-Infrared and Thermal Spectrums","volume":"5","author":"Mekyska","year":"2013","journal-title":"Cogn. Comput."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Mallat, K., and Dugelay, J.L. (2018, January 26\u201328). A benchmark database of visible and thermal paired face images across multiple variations. Proceedings of the International Conference of the Biometrics Special Interest Group, BIOSIG 2018, Darmstadt, Germany.","DOI":"10.23919\/BIOSIG.2018.8553431"},{"key":"ref_11","unstructured":"Hammoud, R.I. (2020, January 20). IEEE OTCBVS WS Series Bench. Available online: http:\/\/vcipl-okstate.org\/pbvs\/bench\/."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"682","DOI":"10.1109\/TMM.2010.2060716","article-title":"A Natural Visible and Infrared Facial Expression Database for Expression Recognition and Emotion Inference","volume":"12","author":"Wang","year":"2010","journal-title":"IEEE Trans. Multimed."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"509","DOI":"10.1109\/TPAMI.2018.2884458","article-title":"A comprehensive database for benchmarking imaging systems","volume":"42","author":"Panetta","year":"2018","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_14","unstructured":"Ghiass, R.S., Bendada, H., and Maldague, X. (2018). Universit\u00e9 Laval Face Motion and Time-Lapse Video Database (UL-FMTV), Universit\u00e9 Laval. Technical Report."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Poster, D., Thielke, M., Nguyen, R., Rajaraman, S., Di, X., Fondje, C.N., Patel, V.M., Short, N.J., Riggan, B.S., and Nasrabadi, N.M. (2021, January 5\u20139). A Large-Scale, Time-Synchronized Visible and Thermal Face Dataset. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACV48630.2021.00160"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"2421","DOI":"10.1121\/1.2229005","article-title":"An audio-visual corpus for speech perception and automatic speech recognition","volume":"120","author":"Cooke","year":"2006","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_17","unstructured":"Chung, J.S., and Zisserman, A. (2016, January 20\u201324). Lip Reading in the Wild. Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Chung, J.S., Senior, A., Vinyals, O., and Zisserman, A. (2017, January 21\u201326). Lip reading sentences in the wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.367"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Campagna, G., Ramesh, R., Xu, S., Fischer, M., and Lam, M.S. (2017, January 3\u20137). Almond: The architecture of an open, crowdsourced, privacy-preserving, programmable virtual assistant. Proceedings of the International Conference on World Wide Web, Perth, Australia.","DOI":"10.1145\/3038912.3052562"},{"key":"ref_20","unstructured":"(2019, December 20). The Best Siri Commands for iOS and MacOS. Available online: https:\/\/www.digitaltrends.com\/mobile\/best-siri-commands\/."},{"key":"ref_21","unstructured":"(2019, December 20). The Complete List of Siri Commands. Available online: https:\/\/www.cnet.com\/how-to\/the-complete-list-of-siri-commands\/."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"2280","DOI":"10.1016\/j.patcog.2014.01.005","article-title":"Automatic generation and detection of highly reliable fiducial markers under occlusion","volume":"47","year":"2014","journal-title":"Pattern Recognit."},{"key":"ref_23","unstructured":"(2020, January 12). ROS Toolbox for MATLAB. Available online: https:\/\/www.mathworks.com\/products\/ros.html."},{"key":"ref_24","unstructured":"(2020, April 15). Structural Similarity Index. Available online: https:\/\/scikit-image.org\/docs\/dev\/auto_examples\/transform\/plot_ssim.html\/."},{"key":"ref_25","unstructured":"(2020, April 15). How-To: Python Compare Two Images. Available online: https:\/\/www.pyimagesearch.com\/2014\/09\/15\/python-compare-two-images\/."},{"key":"ref_26","unstructured":"(2020, April 08). Blur Detection with OpenCV. Available online: https:\/\/www.pyimagesearch.com\/2015\/09\/07\/blur-detection-with-opencv\/."},{"key":"ref_27","first-page":"1755","article-title":"Dlib-ml: A Machine Learning Toolkit","volume":"10","author":"King","year":"2009","journal-title":"J. Mach. Learn. Res."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Szeliski, R. (2010). Computer Vision: Algorithms and Applications, Springer Science & Business Media.","DOI":"10.1007\/978-1-84882-935-0"},{"key":"ref_29","unstructured":"(2019, December 05). Camera calibration with OpenCV. Available online: https:\/\/docs.opencv.org\/2.4\/doc\/tutorials\/calib3d\/camera_calibration\/camera_calibration.html."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Hwang, S., Park, J., Kim, N., Choi, Y., and So Kweon, I. (2015, January 7\u201312). Multispectral pedestrian detection: Benchmark dataset and baseline. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298706"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Kuzdeuov, A., Rubagotti, M., and Varol, H.A. (2020). Neural Network Augmented Sensor Fusion for Pose Estimation of Tensegrity Manipulators. IEEE Sens. J.","DOI":"10.1109\/JSEN.2019.2959574"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.proeng.2014.12.091","article-title":"Visual Localization of Mobile Robot Using Artificial Markers","volume":"96","author":"Babinec","year":"2014","journal-title":"Procedia Eng."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"185","DOI":"10.1007\/s11370-017-0219-8","article-title":"Autonomous flying with quadrocopter using fuzzy control and ArUco markers","volume":"10","author":"Bacik","year":"2017","journal-title":"Intell. Serv. Robot."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Lupu, R.G., Herghelegiu, P., Botezatu, N., Moldoveanu, A., Ferche, O., Ilie, C., and Levinta, A. (2017, January 19\u201321). Virtual reality system for stroke recovery for upper limbs using ArUco markers. Proceedings of the International Conference on System Theory, Control and Computing (ICSTCC), Sinaia, Romania.","DOI":"10.1109\/ICSTCC.2017.8107092"},{"key":"ref_35","unstructured":"(2019, December 05). Camera Calibration and 3D Reconstruction. Available online: https:\/\/docs.opencv.org\/2.4\/modules\/calib3d\/doc\/camera_calibration_and_3d_reconstruction.html."},{"key":"ref_36","unstructured":"(2019, December 05). Geometric Image Transformations. Available online: https:\/\/docs.opencv.org\/2.4\/modules\/imgproc\/doc\/geometric_transformations.html."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Rai, P., and Khanna, P. (2012). Gender classification techniques: A review. Advances in Computer Science, Engineering & Applications, Springer.","DOI":"10.1007\/978-3-642-30157-5_6"},{"key":"ref_38","unstructured":"Assael, Y.M., Shillingford, B., Whiteson, S., and de Freitas, N. (2016). LipNet: Sentence-level Lipreading. arXiv."},{"key":"ref_39","unstructured":"Zeiler, M.D. (2012). Adadelta: an adaptive learning rate method. arXiv."},{"key":"ref_40","unstructured":"Bengio, Y., and LeCun, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. Conference Track Proceedings, Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7\u20139 May 2015, DBLP."},{"key":"ref_41","unstructured":"Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., and Weinberger, K.Q. (2014). Generative Adversarial Nets. Advances in Neural Information Processing Systems, Curran Associates, Inc."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Isola, P., Zhu, J.Y., Zhou, T., and Efros, A.A. (2017, January 21\u201326). Image-to-image translation with conditional adversarial networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.632"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Zhu, J.Y., Park, T., Isola, P., and Efros, A.A. (2017, January 22\u201329). Unpaired image-to-image translation using cycle-consistent adversarial networks. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.244"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Park, T., Efros, A.A., Zhang, R., and Zhu, J.Y. (2020). Contrastive learning for unpaired image-to-image translation. European Conference on Computer Vision, Proceedings of the 16th European Conference, Glasgow, UK, 23\u201328 August 2020, Springer.","DOI":"10.1007\/978-3-030-58545-7_19"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Zhang, T., Wiliem, A., Yang, S., and Lovell, B. (2018, January 20\u201323). TV-GAN: Generative adversarial network based thermal to visible face recognition. Proceedings of the International Conference on Biometrics (ICB), Gold Coast, Australia.","DOI":"10.1109\/ICB2018.2018.00035"},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"1161","DOI":"10.1109\/LSP.2018.2845692","article-title":"Thermal to visible facial image translation using generative adversarial networks","volume":"25","author":"Wang","year":"2018","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_47","unstructured":"(2019, December 05). Face Recognition. Available online: https:\/\/github.com\/opencv\/opencv\/tree\/master\/samples\/dnn\/face_detector."},{"key":"ref_48","unstructured":"Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv."},{"key":"ref_49","unstructured":"(2019, December 05). Face Recognition. Available online: https:\/\/github.com\/ageitgey\/face_recognition."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Shon, S., and Glass, J. (2020, January 25\u201329). Multimodal Association for Speaker Verification. Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), Shanghai, China.","DOI":"10.21437\/Interspeech.2020-1996"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/10\/3465\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:02:14Z","timestamp":1760162534000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/10\/3465"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,5,16]]},"references-count":50,"journal-issue":{"issue":"10","published-online":{"date-parts":[[2021,5]]}},"alternative-id":["s21103465"],"URL":"https:\/\/doi.org\/10.3390\/s21103465","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,5,16]]}}}