{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,5]],"date-time":"2025-07-05T08:24:26Z","timestamp":1751703866526,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":32,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,3,12]],"date-time":"2023-03-12T00:00:00Z","timestamp":1678579200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"JST CREST","award":["JPMJCR17A3"],"award-info":[{"award-number":["JPMJCR17A3"]}]},{"name":"JST Moonshot R&D","award":["JPMJMS2012"],"award-info":[{"award-number":["JPMJMS2012"]}]},{"name":"The Univesity of Tokyo Human Augmentation Research Initiative"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,3,12]]},"DOI":"10.1145\/3582700.3582722","type":"proceedings-article","created":{"date-parts":[[2023,3,14]],"date-time":"2023-03-14T07:50:03Z","timestamp":1678780203000},"page":"200-208","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["AIx Speed: Playback Speed Optimization Using Listening Comprehension of Speech Recognition Models"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5181-320X","authenticated-orcid":false,"given":"Kazuki","family":"Kawamura","sequence":"first","affiliation":[{"name":"The University of Tokyo, Japan and Sony CSL Kyoto, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3629-2514","authenticated-orcid":false,"given":"Jun","family":"Rekimoto","sequence":"additional","affiliation":[{"name":"The University of Tokyo, Japan and Sony CSL Kyoto, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,3,14]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2021.3117472"},{"key":"e_1_3_2_1_2_1","unstructured":"Alexei Baevski Yuhao Zhou Abdelrahman Mohamed and Michael Auli. 2020. Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Advances in Neural Information Processing Systems.  Alexei Baevski Yuhao Zhou Abdelrahman Mohamed and Michael Auli. 2020. Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Advances in Neural Information Processing Systems."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neunet.2021.03.004"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/1518701.1518823"},{"key":"e_1_3_2_1_5_1","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","volume":"1","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171\u20134186. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171\u20134186."},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/IJCIME49369.2019.00087"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1044\/2017_JSLHR-S-16-0269"},{"volume-title":"Proceedings of the International Conference on Machine Learning.","author":"Graves A.","key":"e_1_3_2_1_8_1","unstructured":"A. Graves , S. Fernandez , F. Gomez , and J. Schmidhuber . 2006. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Nets. In ICML \u201906 : Proceedings of the International Conference on Machine Learning. A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. 2006. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Nets. In ICML \u201906: Proceedings of the International Conference on Machine Learning."},{"key":"e_1_3_2_1_9_1","volume-title":"Proc. ACM Conference on Human Factors in Computing Systems.","author":"Higuchi Keita","year":"2017","unstructured":"Keita Higuchi , Ryo Yonetani , and Yoichi Sato . 2017 . EgoScanning: Quickly Scanning First-Person Videos with Egocentric Elastic Timelines . In Proc. ACM Conference on Human Factors in Computing Systems. Keita Higuchi, Ryo Yonetani, and Yoichi Sato. 2017. EgoScanning: Quickly Scanning First-Person Videos with Egocentric Elastic Timelines. In Proc. ACM Conference on Human Factors in Computing Systems."},{"key":"e_1_3_2_1_10_1","unstructured":"Wei-Ning Hsu Benjamin Bolte Yao-Hung\u00a0Hubert Tsai Kushal Lakhotia Ruslan Salakhutdinov and Abdelrahman Mohamed. 2021. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. (2021).  Wei-Ning Hsu Benjamin Bolte Yao-Hung\u00a0Hubert Tsai Kushal Lakhotia Ruslan Salakhutdinov and Abdelrahman Mohamed. 2021. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. (2021)."},{"volume-title":"IEEE 2002 Tenth IEEE International Workshop on Quality of Service (Cat. No.02EX564)","author":"Jiang Wenyu","key":"e_1_3_2_1_11_1","unstructured":"Wenyu Jiang and H. Schulzrinne . 2002. Speech recognition performance as an effective perceived quality predictor . In IEEE 2002 Tenth IEEE International Workshop on Quality of Service (Cat. No.02EX564) . 269\u2013275. Wenyu Jiang and H. Schulzrinne. 2002. Speech recognition performance as an effective perceived quality predictor. In IEEE 2002 Tenth IEEE International Workshop on Quality of Service (Cat. No.02EX564). 269\u2013275."},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2021.3084299"},{"key":"e_1_3_2_1_13_1","volume-title":"Efficient Video Viewing System for Racquet Sports with Automatic Summarization Focusing on Rally Scenes. In ACM SIGGRAPH 2014 Posters.","author":"Kawamura Shunya","year":"2014","unstructured":"Shunya Kawamura , Tsukasa Fukusato , Tatsunori Hirai , and Shigeo Morishima . 2014 . Efficient Video Viewing System for Racquet Sports with Automatic Summarization Focusing on Rally Scenes. In ACM SIGGRAPH 2014 Posters. Shunya Kawamura, Tsukasa Fukusato, Tatsunori Hirai, and Shigeo Morishima. 2014. Efficient Video Viewing System for Racquet Sports with Automatic Summarization Focusing on Rally Scenes. In ACM SIGGRAPH 2014 Posters."},{"key":"e_1_3_2_1_14_1","volume-title":"Dynamic Object Scanning: Object-Based Elastic Timeline for Quickly Browsing First-Person Videos. In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems.","author":"Kayukawa Seita","year":"2018","unstructured":"Seita Kayukawa , Keita Higuchi , Ryo Yonetani , Masanori Nakamura , Yoichi Sato , and Shigeo Morishima . 2018 . Dynamic Object Scanning: Object-Based Elastic Timeline for Quickly Browsing First-Person Videos. In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems. Seita Kayukawa, Keita Higuchi, Ryo Yonetani, Masanori Nakamura, Yoichi Sato, and Shigeo Morishima. 2018. Dynamic Object Scanning: Object-Based Elastic Timeline for Quickly Browsing First-Person Videos. In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems."},{"key":"e_1_3_2_1_15_1","volume-title":"Proc. of the Workshop on Advanced Visual Interfaces AVI.","author":"Kurihara Kazutaka","year":"2011","unstructured":"Kazutaka Kurihara . 2011 . CinemaGazer: A System for Watching Video at Very High Speed . In Proc. of the Workshop on Advanced Visual Interfaces AVI. Kazutaka Kurihara. 2011. CinemaGazer: A System for Watching Video at Very High Speed. In Proc. of the Workshop on Advanced Visual Interfaces AVI."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3375462.3375466"},{"key":"e_1_3_2_1_17_1","volume-title":"Decoupled Weight Decay Regularization. In International Conference on Learning Representations.","author":"Loshchilov Ilya","year":"2019","unstructured":"Ilya Loshchilov and Frank Hutter . 2019 . Decoupled Weight Decay Regularization. In International Conference on Learning Representations. Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations."},{"key":"e_1_3_2_1_18_1","unstructured":"Mishaim Malik Muhammad Malik Khawar Mehmood and Imran Makhdoom. 2021. Automatic speech recognition: A survey. Multimedia Tools and Applications(2021).  Mishaim Malik Muhammad Malik Khawar Mehmood and Imran Makhdoom. 2021. Automatic speech recognition: A survey. Multimedia Tools and Applications(2021)."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.25080\/Majora-7b98e3ed-003"},{"key":"e_1_3_2_1_20_1","unstructured":"Nobuaki Minematsu Yoshihiro Tomiyama Kei Yoshimoto Katsumasa Shimizu Seiichi Nakagawa Masatake Dantsuji and Shozo Makino. 2002. English Speech Database Read by Japanese Learners for CALL System Development.. In LREC. Citeseer.  Nobuaki Minematsu Yoshihiro Tomiyama Kei Yoshimoto Katsumasa Shimizu Seiichi Nakagawa Masatake Dantsuji and Shozo Makino. 2002. English Speech Database Read by Japanese Learners for CALL System Development.. In LREC. Citeseer."},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1002\/acp.3899"},{"key":"e_1_3_2_1_22_1","volume-title":"International Journal for Educational Media and Technology","author":"Nagahama Toru","year":"2017","unstructured":"Toru Nagahama and Yusuke Morita . 2017. Effect Analysis of Playback Speed for Lecture Video Including Instructor Images . International Journal for Educational Media and Technology ( 2017 ). Toru Nagahama and Yusuke Morita. 2017. Effect Analysis of Playback Speed for Lecture Video Including Instructor Images. International Journal for Educational Media and Technology (2017)."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3491140.3528299"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2015.7178964"},{"volume-title":"PyTorch: An Imperative Style","author":"Paszke Adam","key":"e_1_3_2_1_25_1","unstructured":"Adam Paszke , Sam Gross , Francisco Massa , Adam Lerer , James Bradbury , Gregory Chanan , Trevor Killeen , Zeming Lin , Natalia Gimelshein , Luca Antiga , Alban Desmaison , Andreas Kopf , Edward Yang , Zachary DeVito , Martin Raison , Alykhan Tejani , Sasank Chilamkurthy , Benoit Steiner , Lu Fang , Junjie Bai , and Soumith Chintala . 2019. PyTorch: An Imperative Style , High-Performance Deep Learning Library . In Advances in Neural Information Processing Systems 32. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32."},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/2702613.2732711"},{"key":"e_1_3_2_1_27_1","volume-title":"Automatic Speech Recognition (ASR) Systems Applied to Pronunciation Assessment of L2 Spanish for Japanese Speakers. Applied Sciences 11, 15","author":"Tejedor-Garc\u00eda Cristian","year":"2021","unstructured":"Cristian Tejedor-Garc\u00eda , Valent\u00edn Carde\u00f1oso-Payo , and David Escudero-Mancebo . 2021. Automatic Speech Recognition (ASR) Systems Applied to Pronunciation Assessment of L2 Spanish for Japanese Speakers. Applied Sciences 11, 15 ( 2021 ). Cristian Tejedor-Garc\u00eda, Valent\u00edn Carde\u00f1oso-Payo, and David Escudero-Mancebo. 2021. Automatic Speech Recognition (ASR) Systems Applied to Pronunciation Assessment of L2 Spanish for Japanese Speakers. Applied Sciences 11, 15 (2021)."},{"key":"e_1_3_2_1_28_1","unstructured":"A\u00e4ron van\u00a0den Oord Sander Dieleman Heiga Zen Karen Simonyan Oriol Vinyals Alexander Graves Nal Kalchbrenner Andrew Senior and Koray Kavukcuoglu. 2016. WaveNet: A Generative Model for Raw Audio. In Arxiv.  A\u00e4ron van\u00a0den Oord Sander Dieleman Heiga Zen Karen Simonyan Oriol Vinyals Alexander Graves Nal Kalchbrenner Andrew Senior and Koray Kavukcuoglu. 2016. WaveNet: A Generative Model for Raw Audio. In Arxiv."},{"key":"e_1_3_2_1_29_1","volume-title":"Audio Summarization for Podcasts. In 2021 29th European Signal Processing Conference (EUSIPCO). IEEE, 431\u2013435","author":"Vartakavi Aneesh","year":"2021","unstructured":"Aneesh Vartakavi , Amanmeet Garg , and Zafar Rafii . 2021 . Audio Summarization for Podcasts. In 2021 29th European Signal Processing Conference (EUSIPCO). IEEE, 431\u2013435 . Aneesh Vartakavi, Amanmeet Garg, and Zafar Rafii. 2021. Audio Summarization for Podcasts. In 2021 29th European Signal Processing Conference (EUSIPCO). IEEE, 431\u2013435."},{"key":"e_1_3_2_1_30_1","volume-title":"Tacotron: Towards End-to-End Speech Synthesis. In INTERSPEECH. 4006\u20134010.","author":"Wang Yuxuan","year":"2017","unstructured":"Yuxuan Wang , R.\u00a0 J. Skerry-Ryan , Daisy Stanton , Yonghui Wu , Ron\u00a0 J. Weiss , Navdeep Jaitly , Zongheng Yang , Ying Xiao , Zhifeng Chen , Samy Bengio , Quoc\u00a0 V. Le , Yannis Agiomyrgiannakis , Rob Clark , and Rif\u00a0 A. Saurous . 2017 . Tacotron: Towards End-to-End Speech Synthesis. In INTERSPEECH. 4006\u20134010. Yuxuan Wang, R.\u00a0J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron\u00a0J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc\u00a0V. Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif\u00a0A. Saurous. 2017. Tacotron: Towards End-to-End Speech Synthesis. In INTERSPEECH. 4006\u20134010."},{"volume-title":"Achieving Human Parity in Conversational Speech Recognition","author":"Xiong Wayne","key":"e_1_3_2_1_31_1","unstructured":"Wayne Xiong , Jasha Droppo , Xuedong Huang , Frank Seide , Michael Seltzer , Andreas Stolcke , Dong Yu , and Geoffrey Zweig . 2016. Achieving Human Parity in Conversational Speech Recognition . IEEE\/ACM Transactions on Audio, Speech, and Language Processing . Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Michael Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. 2016. Achieving Human Parity in Conversational Speech Recognition. IEEE\/ACM Transactions on Audio, Speech, and Language Processing."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3313831.3376322"}],"event":{"name":"AHs '23: Augmented Humans Conference","acronym":"AHs '23","location":"Glasgow United Kingdom"},"container-title":["Augmented Humans Conference"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3582700.3582722","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3582700.3582722","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:07:55Z","timestamp":1750183675000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3582700.3582722"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,3,12]]},"references-count":32,"alternative-id":["10.1145\/3582700.3582722","10.1145\/3582700"],"URL":"https:\/\/doi.org\/10.1145\/3582700.3582722","relation":{},"subject":[],"published":{"date-parts":[[2023,3,12]]},"assertion":[{"value":"2023-03-14","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}