{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,23]],"date-time":"2025-08-23T05:19:54Z","timestamp":1755926394169,"version":"3.41.0"},"reference-count":79,"publisher":"Association for Computing Machinery (ACM)","issue":"6","license":[{"start":{"date-parts":[[2019,11,8]],"date-time":"2019-11-08T00:00:00Z","timestamp":1573171200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Zhijiang Lab's International Talent Fund for Young Professionals","award":["20191003"],"award-info":[{"award-number":["20191003"]}]},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["1565978"],"award-info":[{"award-number":["1565978"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"name":"CCF-Tencent Open Fund","award":["RAGR20190109"],"award-info":[{"award-number":["RAGR20190109"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Graph."],"published-print":{"date-parts":[[2019,12,31]]},"abstract":"<jats:p>We introduce a novel approach for synthesizing realistic speeches for comics. Using a comic page as input, our approach synthesizes speeches for each comic character following the reading flow. It adopts a cascading strategy to synthesize speeches in two stages: Comic Visual Analysis and Comic Speech Synthesis. In the first stage, the input comic page is analyzed to identify the gender and age of the characters, as well as texts each character speaks and corresponding emotion. Guided by this analysis, in the second stage, our approach synthesizes realistic speeches for each character, which are consistent with the visual observations. Our experiments show that the proposed approach can synthesize realistic and lively speeches for different types of comics. Perceptual studies performed on the synthesis results of multiple sample comics validate the efficacy of our approach.<\/jats:p>","DOI":"10.1145\/3355089.3356487","type":"journal-article","created":{"date-parts":[[2019,11,8]],"date-time":"2019-11-08T20:27:58Z","timestamp":1573244878000},"page":"1-14","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":12,"title":["Comic-guided speech synthesis"],"prefix":"10.1145","volume":"38","author":[{"given":"Yujia","family":"Wang","sequence":"first","affiliation":[{"name":"Beijing Institute of Technology"}]},{"given":"Wenguan","family":"Wang","sequence":"additional","affiliation":[{"name":"Inception Institute of Artificial Intelligence &amp; Beijing Institute of Technology"}]},{"given":"Wei","family":"Liang","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}]},{"given":"Lap-Fai","family":"Yu","sequence":"additional","affiliation":[{"name":"George Mason University"}]}],"member":"320","published-online":{"date-parts":[[2019,11,8]]},"reference":[{"key":"e_1_2_2_1_1","unstructured":"Waleed Abdulla. 2017. Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow. https:\/\/github.com\/matterport\/Mask_RCNN."},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.3390\/jimaging4070087"},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1037\/0022-3514.70.3.614"},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1111\/j.2044-8295.2011.02041.x"},{"key":"e_1_2_2_5_1","volume-title":"Thinking the voice: neural correlates of voice perception. Trends in cognitive sciences 8, 3","author":"Belin Pascal","year":"2004","unstructured":"Pascal Belin, Shirley Fecteau, and Catherine Bedard. 2004. Thinking the voice: neural correlates of voice perception. Trends in cognitive sciences 8, 3 (2004), 129--135."},{"key":"e_1_2_2_6_1","volume-title":"A System for Doing Phonetics by Computer. Glot International 5, 9\/10","author":"Boersma P. Praat","year":"2002","unstructured":"P. Praat Boersma. 2002. A System for Doing Phonetics by Computer. Glot International 5, 9\/10 (2002), 341--345."},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1111\/j.2044-8295.1986.tb02199.x"},{"key":"e_1_2_2_8_1","volume-title":"Integrating face and voice in person perception. Trends in cognitive sciences 11, 12","author":"Campanella Salvatore","year":"2007","unstructured":"Salvatore Campanella and Pascal Belin. 2007. Integrating face and voice in person perception. Trends in cognitive sciences 11, 12 (2007), 535--543."},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2897824.2925873"},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2366145.2366160"},{"key":"e_1_2_2_11_1","first-page":"1","article-title":"Look over here:attention-directing composition of manga elements","volume":"33","author":"Cao Ying","year":"2014","unstructured":"Ying Cao, Rynson W. H. Lau, and Antoni B. Chan. 2014. Look over here:attention-directing composition of manga elements. TOG 33, 4 (2014), 1--11.","journal-title":"TOG"},{"key":"e_1_2_2_12_1","doi-asserted-by":"crossref","unstructured":"Wei-Ta Chu and Wei-Wei Li. 2017. Manga facenet: Face detection in manga based on deep neural network. In ICMR. ACM 412--415.","DOI":"10.1145\/3078971.3079031"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2018.08.008"},{"key":"e_1_2_2_14_1","volume-title":"Perception of levels of emotion in speech prosody","author":"Dimos K","year":"2015","unstructured":"K Dimos, L Dick, and V Dellwo. 2015. Perception of levels of emotion in speech prosody. The Scottish Consortium for ICPhS (2015)."},{"volume-title":"Empirical Comics Research: Digital, Multimodal, and Cognitive Methods","author":"Dunst Alexander","key":"e_1_2_2_15_1","unstructured":"Alexander Dunst, Jochen Laubrock, and Janina Wildfeuer. 2018. Empirical Comics Research: Digital, Multimodal, and Cognitive Methods. Routledge."},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3197517.3201326"},{"key":"e_1_2_2_17_1","unstructured":"Jesse Engel Cinjon Resnick Adam Roberts Sander Dieleman Mohammad Norouzi Douglas Eck and Karen Simonyan. 2017. Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. In ICML. 1068--1077."},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neunet.2017.02.013"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1169"},{"key":"e_1_2_2_20_1","first-page":"96","article-title":"VoCo: text-based insertion and replacement in audio narration","volume":"36","author":"Finkelstein Adam","year":"2017","unstructured":"Adam Finkelstein, Adam Finkelstein, Adam Finkelstein, Adam Finkelstein, and Adam Finkelstein. 2017. VoCo: text-based insertion and replacement in audio narration. TOG 36, 4 (2017), 96.","journal-title":"TOG"},{"key":"e_1_2_2_21_1","volume-title":"Clustering by passing messages between data points. science 315, 5814","author":"Frey Brendan J","year":"2007","unstructured":"Brendan J Frey and Delbert Dueck. 2007. Clustering by passing messages between data points. science 315, 5814 (2007), 972--976."},{"key":"e_1_2_2_22_1","doi-asserted-by":"crossref","unstructured":"Aviv Gabbay Asaph Shamir and Shmuel Peleg. 2018. Visual Speech Enhancement. In Interspeech. 1170--1174.","DOI":"10.21437\/Interspeech.2018-1955"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cub.2008.03.030"},{"key":"e_1_2_2_24_1","doi-asserted-by":"crossref","unstructured":"Ankush Gupta Andrea Vedaldi and Andrew Zisserman. 2016. Synthetic data for text localisation in natural images. In CVPR. 2315--2324.","DOI":"10.1109\/CVPR.2016.254"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2006.100"},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/57.1.97"},{"key":"e_1_2_2_27_1","first-page":"18","article-title":"Sonic mediatization of the book: affordances of the audiobook. MedieKultur","volume":"29","author":"Have Iben","year":"2013","unstructured":"Iben Have and Birgitte Stougaard Pedersen. 2013. Sonic mediatization of the book: affordances of the audiobook. MedieKultur: Journal of media and communication research 29, 54 (2013), 18-p.","journal-title":"Journal of media and communication research"},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3130800.31310887"},{"key":"e_1_2_2_29_1","first-page":"373","article-title":"Unit selection in a concatenative speech synthesis system using a large speech database","volume":"1","author":"Hunt Andrew J","year":"1996","unstructured":"Andrew J Hunt and Alan W Black. 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In ICASSP, Vol. 1. 373--376.","journal-title":"ICASSP"},{"key":"e_1_2_2_30_1","first-page":"2410","article-title":"Efficient Neural Audio Synthesis","volume":"80","author":"Kalchbrenner Nal","year":"2018","unstructured":"Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Efficient Neural Audio Synthesis. In ICML, Vol. 80. 2410--2419.","journal-title":"ICML"},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cub.2003.09.005"},{"key":"e_1_2_2_32_1","doi-asserted-by":"crossref","unstructured":"Hideki Kawahara Masanori Morise Toru Takahashi Ryuichi Nisimura Toshio Irino and Hideki Banno. 2008. TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum F0 and aperiodicity estimation. In ICASSP. 3933--3936.","DOI":"10.1109\/ICASSP.2008.4518514"},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.dss.2018.09.002"},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICEBE.2015.55"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33011707"},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1121\/1.380917"},{"key":"e_1_2_2_37_1","volume-title":"Emotional End-to-End Neural Speech Synthesizer. In NIPS Workshop.","author":"Lee Younggun","year":"2017","unstructured":"Younggun Lee, Azam Rabiee, and Soo-Young Lee. 2017. Emotional End-to-End Neural Speech Synthesizer. In NIPS Workshop."},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073675"},{"key":"e_1_2_2_39_1","volume-title":"EMPHASIS: An Emotional Phoneme-based Acoustic Model for Speech Synthesis System. arXiv preprint arXiv:1806.09276","author":"Li Hao","year":"2018","unstructured":"Hao Li, Yongguo Kang, and Zhenyu Wang. 2018. EMPHASIS: An Emotional Phoneme-based Acoustic Model for Speech Synthesis System. arXiv preprint arXiv:1806.09276 (2018)."},{"key":"e_1_2_2_40_1","doi-asserted-by":"crossref","unstructured":"Ziwei Liu Ping Luo Xiaogang Wang and Xiaoou Tang. 2015. Deep learning face attributes in the wild. In ICCV. 3730--3738.","DOI":"10.1109\/ICCV.2015.425"},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2018.2818020"},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-016-4020-z"},{"key":"e_1_2_2_43_1","volume-title":"How do you say \"Hello\"? Personality impressions from brief novel voices. PLOS ONE 9, 3","author":"McAleer Phil","year":"2014","unstructured":"Phil McAleer, Alexander Todorov, and Pascal Belin. 2014. How do you say \"Hello\"? Personality impressions from brief novel voices. PLOS ONE 9, 3 (2014), e90779."},{"key":"e_1_2_2_44_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/1609967.1609969","article-title":"Talking bodies:Sensitivity to desynchronization of conversations","volume":"6","author":"Mcdonnell Rachel","year":"2009","unstructured":"Rachel Mcdonnell, Cathy Ennis, Simon Dobbyn, and Carol O'Sullivan. 2009. Talking bodies:Sensitivity to desynchronization of conversations. TAP 6, 4 (2009), 1--8.","journal-title":"TAP"},{"key":"e_1_2_2_45_1","unstructured":"Soroush Mehri Kundan Kumar Ishaan Gulrajani Rithesh Kumar Shubham Jain Jose Sotelo Aaron Courville and Yoshua Bengio. 2017. SampleRNN: An unconditional end-to-end neural audio generation model. In ICLR."},{"key":"e_1_2_2_46_1","volume-title":"Beyond the schema given: Affective comprehension of literary narratives. 3, 1","author":"Miall David S.","year":"1989","unstructured":"David S. Miall. 1989. Beyond the schema given: Affective comprehension of literary narratives. 3, 1 (1989), 55--78."},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2017.01.008"},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1587\/transinf.2015EDP7457"},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1016\/S1350-4533(02)00060-7"},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2017.290"},{"key":"e_1_2_2_51_1","volume-title":"Object Detection for Comics using Manga109 Annotations. CoRR abs\/1803.08670","author":"Ogawa Toru","year":"2018","unstructured":"Toru Ogawa, Atsushi Otsubo, Rei Narita, Yusuke Matsui, Toshihiko Yamasaki, and Kiyoharu Aizawa. 2018. Object Detection for Comics using Manga109 Annotations. CoRR abs\/1803.08670 (2018)."},{"key":"e_1_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/2948066"},{"key":"e_1_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1121\/1.4991356"},{"key":"e_1_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654990"},{"key":"e_1_2_2_55_1","doi-asserted-by":"crossref","unstructured":"Robert S Petersen. 2011. Comics manga and graphic novles: a history of graphic narratives. ABC-CLIO.","DOI":"10.5040\/9798400628856"},{"key":"e_1_2_2_56_1","unstructured":"Siyuan Qi Wenguan Wang Baoxiong Jia Jianbing Shen and Song-Chun Zhu. 2018. Learning human-object interactions by graph parsing neural networks. In ECCV. 401--417."},{"key":"e_1_2_2_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2017.178"},{"key":"e_1_2_2_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/1409060.1409108"},{"key":"e_1_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/1141911.1142017"},{"volume-title":"The Author's Guide to Audiobook Creation","author":"Rieman Richard","key":"e_1_2_2_60_1","unstructured":"Richard Rieman. 2016. The Author's Guide to Audiobook Creation. Breckenridge Press."},{"key":"e_1_2_2_61_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10032-015-0243-1"},{"key":"e_1_2_2_62_1","volume-title":"Moon: A mixed objective optimization network for the recognition of facial attributes. In ECCV. 19--35.","author":"Rudd Ethan M","year":"2016","unstructured":"Ethan M Rudd, Manuel G\u00fcnther, and Terrance E Boult. 2016. Moon: A mixed objective optimization network for the recognition of facial attributes. In ECCV. 19--35."},{"key":"e_1_2_2_63_1","doi-asserted-by":"crossref","unstructured":"Jonathan Shen Ruoming Pang Ron J Weiss Mike Schuster Navdeep Jaitly Zongheng Yang Zhifeng Chen Yu Zhang Yuxuan Wang Rj Skerrv-Ryan et al. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In ICASSP. IEEE 4779--4783.","DOI":"10.1109\/ICASSP.2018.8461368"},{"key":"e_1_2_2_64_1","volume-title":"Aster: An attentional scene text recognizer with flexible rectification. TPAMI","author":"Shi Baoguang","year":"2018","unstructured":"Baoguang Shi, Mingkun Yang, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2018. Aster: An attentional scene text recognizer with flexible rectification. TPAMI (2018)."},{"key":"e_1_2_2_65_1","doi-asserted-by":"publisher","DOI":"10.1145\/2897824.2925972"},{"key":"e_1_2_2_66_1","unstructured":"RJ Skerry-Ryan Eric Battenberg Ying Xiao Yuxuan Wang Daisy Stanton Joel Shor Ron J Weiss Rob Clark and Rif A Saurous. 2018. Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron. In ICML. 4700--4709."},{"key":"e_1_2_2_67_1","volume-title":"International Conference on Learning Representations Workshop.","author":"Sotelo Jose","year":"2017","unstructured":"Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, and Yoshua Bengio. 2017. Char2wav: End-to-end speech synthesis. In International Conference on Learning Representations Workshop."},{"key":"e_1_2_2_68_1","volume-title":"Facial Landmark Detection for Manga Images. arXiv preprint arXiv:1811.03214","author":"Stricker Marco","year":"2018","unstructured":"Marco Stricker, Olivier Augereau, Koichi Kise, and Motoi Iwata. 2018. Facial Landmark Detection for Manga Images. arXiv preprint arXiv:1811.03214 (2018)."},{"key":"e_1_2_2_69_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073640"},{"key":"e_1_2_2_70_1","volume-title":"Emotions in Everyday Life. PLOS ONE 10, 12","author":"Trampe Debra","year":"2015","unstructured":"Debra Trampe, Jordi Quoidbach, and Maxime Taquet. 2015. Emotions in Everyday Life. PLOS ONE 10, 12 (2015)."},{"key":"e_1_2_2_71_1","volume-title":"WaveNet: A Generative Model for Raw Audio. In 9th ISCA Speech Synthesis Workshop. 125--125","author":"van den Oord Aaron","year":"2016","unstructured":"Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. WaveNet: A Generative Model for Raw Audio. In 9th ISCA Speech Synthesis Workshop. 125--125."},{"volume-title":"The language of comics: Word and image. Univ","author":"Varnum Robin","key":"e_1_2_2_72_1","unstructured":"Robin Varnum and Christina T Gibbons. 2007. The language of comics: Word and image. Univ, Press of Mississippi."},{"key":"e_1_2_2_73_1","unstructured":"Christophe Veaux Junichi Yamagishi Kirsten MacDonald et al. 2016. SUPERSEDEDCSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit. (2016)."},{"key":"e_1_2_2_74_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2840724"},{"key":"e_1_2_2_75_1","doi-asserted-by":"crossref","unstructured":"Wenguan Wang Yuanlu Xu Jianbing Shen and Song-Chun Zhu. 2018c. Attentive fashion grammar network for fashion landmark detection and clothing category classification. In CVPR. 4271--4280.","DOI":"10.1109\/CVPR.2018.00449"},{"key":"e_1_2_2_76_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2019.05.026"},{"key":"e_1_2_2_77_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2017-1452"},{"key":"e_1_2_2_78_1","volume-title":"Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. In ICML.","author":"Wang Yuxuan","year":"2018","unstructured":"Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, and Rif A Saurous. 2018b. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. In ICML."},{"volume-title":"Learning utterance-level representations for speech emotion and age\/gender recognition using deep neural networks","author":"Wang Zhong-Qiu","key":"e_1_2_2_79_1","unstructured":"Zhong-Qiu Wang and Ivan Tashev. 2017. Learning utterance-level representations for speech emotion and age\/gender recognition using deep neural networks. In ICASSP. IEEE, 5150--5154."}],"container-title":["ACM Transactions on Graphics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3355089.3356487","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3355089.3356487","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3355089.3356487","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:44:40Z","timestamp":1750203880000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3355089.3356487"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,11,8]]},"references-count":79,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2019,12,31]]}},"alternative-id":["10.1145\/3355089.3356487"],"URL":"https:\/\/doi.org\/10.1145\/3355089.3356487","relation":{},"ISSN":["0730-0301","1557-7368"],"issn-type":[{"type":"print","value":"0730-0301"},{"type":"electronic","value":"1557-7368"}],"subject":[],"published":{"date-parts":[[2019,11,8]]},"assertion":[{"value":"2019-11-08","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}