{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,7]],"date-time":"2026-01-07T07:57:55Z","timestamp":1767772675851,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":45,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,12,19]],"date-time":"2021-12-19T00:00:00Z","timestamp":1639872000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,12,19]]},"DOI":"10.1145\/3490035.3490284","type":"proceedings-article","created":{"date-parts":[[2021,12,14]],"date-time":"2021-12-14T23:15:16Z","timestamp":1639523716000},"page":"1-9","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["Intelligent video editing"],"prefix":"10.1145","author":[{"given":"Anchit","family":"Gupta","sequence":"first","affiliation":[{"name":"IIIT, Hyderabad, Telangana, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Faizan Farooq","family":"Khan","sequence":"additional","affiliation":[{"name":"IIIT, Hyderabad, Telangana, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rudrabha","family":"Mukhopadhyay","sequence":"additional","affiliation":[{"name":"IIIT, Hyderabad, Telangana, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Vinay P.","family":"Namboodiri","sequence":"additional","affiliation":[{"name":"University of Bath, England"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"C. V.","family":"Jawahar","sequence":"additional","affiliation":[{"name":"IIIT, Hyderabad, Telangana, India"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,12,19]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"2021. Amazon Transcribe. https:\/\/aws.amazon.com\/transcribe\/?nc=sn&loc=1  2021. Amazon Transcribe. https:\/\/aws.amazon.com\/transcribe\/?nc=sn&loc=1"},{"key":"e_1_3_2_1_2_1","unstructured":"2021. Speech-to-Text: Automatic Speech Recognition Google Cloud. https:\/\/cloud.google.com\/speech-to-text  2021. Speech-to-Text: Automatic Speech Recognition Google Cloud. https:\/\/cloud.google.com\/speech-to-text"},{"key":"e_1_3_2_1_3_1","unstructured":"Dario Amodei Rishita Anubhai Eric Battenberg Carl Case Jared Casper Bryan Catanzaro Jingdong Chen Mike Chrzanowski Adam Coates Greg Diamos Erich Elsen Jesse Engel Linxi Fan Christopher Fougner Tony Han Awni Hannun Billy Jun Patrick LeGresley Libby Lin Sharan Narang Andrew Ng Sherjil Ozair Ryan Prenger Jonathan Raiman Sanjeev Satheesh David Seetapun Shubho Sengupta Yi Wang Zhiqian Wang Chong Wang Bo Xiao Dani Yogatama Jun Zhan and Zhenyao Zhu. 2015. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. arXiv:1512.02595 [cs.CL]  Dario Amodei Rishita Anubhai Eric Battenberg Carl Case Jared Casper Bryan Catanzaro Jingdong Chen Mike Chrzanowski Adam Coates Greg Diamos Erich Elsen Jesse Engel Linxi Fan Christopher Fougner Tony Han Awni Hannun Billy Jun Patrick LeGresley Libby Lin Sharan Narang Andrew Ng Sherjil Ozair Ryan Prenger Jonathan Raiman Sanjeev Satheesh David Seetapun Shubho Sengupta Yi Wang Zhiqian Wang Chong Wang Bo Xiao Dani Yogatama Jun Zhan and Zhenyao Zhu. 2015. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. arXiv:1512.02595 [cs.CL]"},{"key":"e_1_3_2_1_4_1","volume-title":"Seong Joon Oh, and Hwalsuk Lee","author":"Baek Jeonghun","year":"2019","unstructured":"Jeonghun Baek , Geewook Kim , Junyeop Lee , Sungrae Park , Dongyoon Han , Sangdoo Yun , Seong Joon Oh, and Hwalsuk Lee . 2019 . What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis . arXiv:1904.01906 [cs.CV] Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee. 2019. What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis. arXiv:1904.01906 [cs.CV]"},{"key":"e_1_3_2_1_5_1","unstructured":"Dzmitry Bahdanau Kyunghyun Cho and Yoshua Bengio. 2016. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473 [cs.CL]  Dzmitry Bahdanau Kyunghyun Cho and Yoshua Bengio. 2016. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473 [cs.CL]"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"crossref","unstructured":"Guha Balakrishnan Amy Zhao Adrian V. Dalca Fredo Durand and John Guttag. 2018. Synthesizing Images of Humans in Unseen Poses. arXiv:1804.07739 [cs.CV]  Guha Balakrishnan Amy Zhao Adrian V. Dalca Fredo Durand and John Guttag. 2018. Synthesizing Images of Humans in Unseen Poses. arXiv:1804.07739 [cs.CV]","DOI":"10.1109\/CVPR.2018.00870"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"crossref","unstructured":"Aayush Bansal Shugao Ma Deva Ramanan and Yaser Sheikh. 2018. Recycle-GAN: Unsupervised Video Retargeting. arXiv:1808.05174 [cs.CV]  Aayush Bansal Shugao Ma Deva Ramanan and Yaser Sheikh. 2018. Recycle-GAN: Unsupervised Video Retargeting. arXiv:1808.05174 [cs.CV]","DOI":"10.1007\/978-3-030-01228-1_8"},{"key":"e_1_3_2_1_8_1","volume-title":"You said that? CoRR abs\/1705.02966","author":"Chung Joon Son","year":"2017","unstructured":"Joon Son Chung , Amir Jamaludin , and Andrew Zisserman . 2017. You said that? CoRR abs\/1705.02966 ( 2017 ). arXiv:1705.02966 http:\/\/arxiv.org\/abs\/1705.02966 Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. 2017. You said that? CoRR abs\/1705.02966 (2017). arXiv:1705.02966 http:\/\/arxiv.org\/abs\/1705.02966"},{"key":"e_1_3_2_1_9_1","unstructured":"Ronan Collobert Christian Puhrsch and Gabriel Synnaeve. 2016. Wav2Letter: an End-to-End ConvNet-based Speech Recognition System. arXiv:1609.03193 [cs.LG]  Ronan Collobert Christian Puhrsch and Gabriel Synnaeve. 2016. Wav2Letter: an End-to-End ConvNet-based Speech Recognition System. arXiv:1609.03193 [cs.LG]"},{"key":"e_1_3_2_1_10_1","unstructured":"deepfakes. 2021. FaceSwap. https:\/\/github.com\/deepfakes\/faceswap  deepfakes. 2021. FaceSwap. https:\/\/github.com\/deepfakes\/faceswap"},{"key":"e_1_3_2_1_11_1","unstructured":"dunnousername. 2021. Front-end tool for first-order-motion. https:\/\/github.com\/dunnousername\/yanderifier  dunnousername. 2021. Front-end tool for first-order-motion. https:\/\/github.com\/dunnousername\/yanderifier"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969033.2969125"},{"key":"e_1_3_2_1_13_1","unstructured":"Keith Ito and Linda Johnson. 2017. The LJ Speech Dataset. https:\/\/keithito.com\/LJ-Speech-Dataset\/.  Keith Ito and Linda Johnson. 2017. The LJ Speech Dataset. https:\/\/keithito.com\/LJ-Speech-Dataset\/."},{"key":"e_1_3_2_1_14_1","unstructured":"JaidedAI. 2021. JaidedAI\/EasyOCR: Ready-to-use OCR with 80 supported languages and all popular writing scripts including Latin Chinese Arabic Devanagari Cyrillic and etc. https:\/\/github.com\/JaidedAI\/EasyOCR  JaidedAI. 2021. JaidedAI\/EasyOCR: Ready-to-use OCR with 80 supported languages and all popular writing scripts including Latin Chinese Arabic Devanagari Cyrillic and etc. https:\/\/github.com\/JaidedAI\/EasyOCR"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3351066"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3355089.3356500"},{"key":"e_1_3_2_1_17_1","volume-title":"Lin (Eds.)","volume":"33","author":"Kim Jaehyeon","year":"2020","unstructured":"Jaehyeon Kim , Sungwon Kim , Jungil Kong , and Sungroh Yoon . 2020 . Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H . Lin (Eds.) , Vol. 33 . Curran Associates, Inc., 8067--8077. https:\/\/proceedings.neurips.cc\/paper\/ 2020\/file\/5c3b99e8f92532e5ad1556e53ceea00c-Paper.pdf Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. 2020. Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 8067--8077. https:\/\/proceedings.neurips.cc\/paper\/2020\/file\/5c3b99e8f92532e5ad1556e53ceea00c-Paper.pdf"},{"key":"e_1_3_2_1_18_1","unstructured":"Gary H. Kranz. 1986. TRANSCRIBE. https:\/\/aws.amazon.com\/transcribe\/  Gary H. Kranz. 1986. TRANSCRIBE. https:\/\/aws.amazon.com\/transcribe\/"},{"key":"e_1_3_2_1_19_1","unstructured":"Matthias A Lee. 2021. Python Tesseract. https:\/\/github.com\/madmaze\/pytesseract  Matthias A Lee. 2021. Python Tesseract. https:\/\/github.com\/madmaze\/pytesseract"},{"key":"e_1_3_2_1_20_1","unstructured":"LLC OpenShot Studios. 2021. https:\/\/www.openshot.org\/  LLC OpenShot Studios. 2021. https:\/\/www.openshot.org\/"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.5555\/3454287.3455008"},{"key":"e_1_3_2_1_22_1","volume-title":"Luis RP, Jian Jiang, Sheng Zhang, Pingyu Wu, Bo Zhou, and Weiming Zhang.","author":"Perov Ivan","year":"2021","unstructured":"Ivan Perov , Daiheng Gao , Nikolay Chervoniy , Kunlin Liu , Sugasa Marangonda , Chris Um\u00e9 , Mr. Dpfks , Carl Shift Facenheim , Luis RP, Jian Jiang, Sheng Zhang, Pingyu Wu, Bo Zhou, and Weiming Zhang. 2021 . DeepFaceLab : Integrated , flexible and extensible face-swapping framework. arXiv:2005.05535 [cs.CV] Ivan Perov, Daiheng Gao, Nikolay Chervoniy, Kunlin Liu, Sugasa Marangonda, Chris Um\u00e9, Mr. Dpfks, Carl Shift Facenheim, Luis RP, Jian Jiang, Sheng Zhang, Pingyu Wu, Bo Zhou, and Weiming Zhang. 2021. DeepFaceLab: Integrated, flexible and extensible face-swapping framework. arXiv:2005.05535 [cs.CV]"},{"key":"e_1_3_2_1_23_1","unstructured":"Jerin Philip Vinay P. Namboodiri and C.V. Jawahar. 2018. CVIT-MT Systems for WAT-2018. In Proceedings of the 32nd Pacific Asia Conference on Language Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation. Association for Computational Linguistics Hong Kong. https:\/\/aclanthology.org\/Y18-3010  Jerin Philip Vinay P. Namboodiri and C.V. Jawahar. 2018. CVIT-MT Systems for WAT-2018. In Proceedings of the 32nd Pacific Asia Conference on Language Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation. Association for Computational Linguistics Hong Kong. https:\/\/aclanthology.org\/Y18-3010"},{"key":"e_1_3_2_1_24_1","unstructured":"Jerin Philip Vinay P. Namboodiri and C. V. Jawahar. 2019. A Baseline Neural Machine Translation System for Indian Languages. arXiv:1907.12437 [cs.CL]  Jerin Philip Vinay P. Namboodiri and C. V. Jawahar. 2019. A Baseline Neural Machine Translation System for Indian Languages. arXiv:1907.12437 [cs.CL]"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-5215"},{"key":"e_1_3_2_1_26_1","volume-title":"International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=HJtEm4p6Z","author":"Ping Wei","year":"2018","unstructured":"Wei Ping , Kainan Peng , Andrew Gibiansky , Sercan O. Arik , Ajay Kannan , Sharan Narang , Jonathan Raiman , and John Miller . 2018 . Deep Voice 3: 2000-Speaker Neural Text-to-Speech . In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=HJtEm4p6Z Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. 2018. Deep Voice 3: 2000-Speaker Neural Text-to-Speech. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=HJtEm4p6Z"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413532"},{"key":"e_1_3_2_1_28_1","volume-title":"FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. ArXiv abs\/2006.04558","author":"Ren Yi","year":"2021","unstructured":"Yi Ren , Chenxu Hu , Xu Tan , Tao Qin , Sheng Zhao , Zhou Zhao , and Tie-Yan Liu . 2021. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. ArXiv abs\/2006.04558 ( 2021 ). https:\/\/arxiv.org\/pdf\/2006.04558.pdf Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2021. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. ArXiv abs\/2006.04558 (2021). https:\/\/arxiv.org\/pdf\/2006.04558.pdf"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.5555\/3454287.3454572"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8461368"},{"key":"e_1_3_2_1_31_1","volume-title":"Jacob Carlson, and Weining Li.","author":"Shen Zejiang","year":"2021","unstructured":"Zejiang Shen , Ruochen Zhang , Melissa Dell , Benjamin Charles Germain Lee , Jacob Carlson, and Weining Li. 2021 . LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis . arXiv preprint arXiv:2103.15348 (2021). Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, and Weining Li. 2021. LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. arXiv preprint arXiv:2103.15348 (2021)."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00248"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.5555\/3454287.3454928"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.5555\/3367032.3367163"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969033.2969173"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073640"},{"key":"e_1_3_2_1_37_1","unstructured":"Silero Team. 2021. Silero Models: pre-trained enterprise-grade STT \/ TTS models and benchmarks. https:\/\/github.com\/snakers4\/silero-models.  Silero Team. 2021. Silero Models: pre-trained enterprise-grade STT \/ TTS models and benchmarks. https:\/\/github.com\/snakers4\/silero-models."},{"key":"e_1_3_2_1_38_1","unstructured":"Tesseract-Ocr. 2021. tesseract-ocr\/tesseract: Tesseract Open Source OCR Engine (main repository). https:\/\/github.com\/tesseract-ocr\/tesseract  Tesseract-Ocr. 2021. tesseract-ocr\/tesseract: Tesseract Open Source OCR Engine (main repository). https:\/\/github.com\/tesseract-ocr\/tesseract"},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295349"},{"key":"e_1_3_2_1_40_1","unstructured":"vocali.se. 2021. https:\/\/vocali.se\/en  vocali.se. 2021. https:\/\/vocali.se\/en"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"crossref","unstructured":"Zbigniew Wojna Alex Gorban Dar-Shyang Lee Kevin Murphy Qian Yu Yeqing Li and Julian Ibarz. 2017. Attention-based Extraction of Structured Information from Street View Imagery. arXiv:1704.03549 [cs.CV]  Zbigniew Wojna Alex Gorban Dar-Shyang Lee Kevin Murphy Qian Yu Yeqing Li and Julian Ibarz. 2017. Attention-based Extraction of Structured Information from Street View Imagery. arXiv:1704.03549 [cs.CV]","DOI":"10.1109\/ICDAR.2017.143"},{"key":"e_1_3_2_1_42_1","volume-title":"Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR abs\/1609.08144","author":"Wu Yonghui","year":"2016","unstructured":"Yonghui Wu , Mike Schuster , Zhifeng Chen , Quoc V. Le , Mohammad Norouzi , Wolfgang Macherey , Maxim Krikun , Yuan Cao , Qin Gao , Klaus Macherey , Jeff Klingner , Apurva Shah , Melvin Johnson , Xiaobing Liu , \u0141ukasz Kaiser , Stephan Gouws , Yoshikiyo Kato , Taku Kudo , Hideto Kazawa , Keith Stevens , George Kurian , Nishant Patil , Wei Wang , Cliff Young , Jason Smith , Jason Riesa , Alex Rudnick , Oriol Vinyals , Greg Corrado , Macduff Hughes , and Jeffrey Dean . 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR abs\/1609.08144 ( 2016 ). http:\/\/arxiv.org\/abs\/1609.08144 Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, \u0141ukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR abs\/1609.08144 (2016). http:\/\/arxiv.org\/abs\/1609.08144"},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3449063"},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33019299"},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3414685.3417774"}],"event":{"name":"ICVGIP '21: Indian Conference on Computer Vision, Graphics and Image Processing","acronym":"ICVGIP '21","location":"Jodhpur India"},"container-title":["Proceedings of the Twelfth Indian Conference on Computer Vision, Graphics and Image Processing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3490035.3490284","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3490035.3490284","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:31:22Z","timestamp":1750188682000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3490035.3490284"}},"subtitle":["incorporating modern talking face generation algorithms in a video editor"],"short-title":[],"issued":{"date-parts":[[2021,12,19]]},"references-count":45,"alternative-id":["10.1145\/3490035.3490284","10.1145\/3490035"],"URL":"https:\/\/doi.org\/10.1145\/3490035.3490284","relation":{},"subject":[],"published":{"date-parts":[[2021,12,19]]},"assertion":[{"value":"2021-12-19","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}