{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,3]],"date-time":"2026-02-03T19:42:34Z","timestamp":1770147754323,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":45,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3551576","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:43:12Z","timestamp":1665416592000},"page":"7045-7049","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":20,"title":["Deeply Exploit Visual and Language Information for Social Media Popularity Prediction"],"prefix":"10.1145","author":[{"given":"Jianmin","family":"Wu","sequence":"first","affiliation":[{"name":"Alibaba Group, Hangzhou, China"}]},{"given":"Liming","family":"Zhao","sequence":"additional","affiliation":[{"name":"Alibaba Group, Hangzhou, China"}]},{"given":"Dangwei","family":"Li","sequence":"additional","affiliation":[{"name":"Alibaba Group, Hangzhou, China"}]},{"given":"Chen-Wei","family":"Xie","sequence":"additional","affiliation":[{"name":"Alibaba Group, Hangzhou, China"}]},{"given":"Siyang","family":"Sun","sequence":"additional","affiliation":[{"name":"Alibaba Group, Hangzhou, China"}]},{"given":"Yun","family":"Zheng","sequence":"additional","affiliation":[{"name":"Alibaba Group, Hangzhou, China"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","volume-title":"A CLIP-Hitchhiker's Guide to Long Video Retrieval. arXiv preprint arXiv:2205.08508","author":"Bain Max","year":"2022","unstructured":"Max Bain , Arsha Nagrani , G\u00fcl Varol , and Andrew Zisserman . 2022. A CLIP-Hitchhiker's Guide to Long Video Retrieval. arXiv preprint arXiv:2205.08508 ( 2022 ). Max Bain, Arsha Nagrani, G\u00fcl Varol, and Andrew Zisserman. 2022. A CLIP-Hitchhiker's Guide to Long Video Retrieval. arXiv preprint arXiv:2205.08508 (2022)."},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3336191.3371834"},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2018.12.039"},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3356072"},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/2566486.2567997"},{"key":"e_1_3_2_2_6_1","volume-title":"Fine-grained image captioning with clip reward. arXiv preprint arXiv:2205.13115","author":"Cho Jaemin","year":"2022","unstructured":"Jaemin Cho , Seunghyun Yoon , Ajinkya Kale , Franck Dernoncourt , Trung Bui , and Mohit Bansal . 2022. Fine-grained image captioning with clip reward. arXiv preprint arXiv:2205.13115 ( 2022 ). Jaemin Cho, Seunghyun Yoon, Ajinkya Kale, Franck Dernoncourt, Trung Bui, and Mohit Bansal. 2022. Fine-grained image captioning with clip reward. arXiv preprint arXiv:2205.13115 (2022)."},{"key":"e_1_3_2_2_7_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3356062"},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2631775.2631808"},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2733373.2806361"},{"key":"e_1_3_2_2_11_1","volume-title":"Open-vocabulary Object Detection via Vision and Language Knowledge Distillation. arXiv preprint arXiv:2104.13921","author":"Gu Xiuye","year":"2021","unstructured":"Xiuye Gu , Tsung-Yi Lin , Weicheng Kuo , and Yin Cui . 2021. Open-vocabulary Object Detection via Vision and Language Knowledge Distillation. arXiv preprint arXiv:2104.13921 ( 2021 ). Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. 2021. Open-vocabulary Object Detection via Vision and Language Knowledge Distillation. arXiv preprint arXiv:2104.13921 (2021)."},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2656405"},{"key":"e_1_3_2_2_13_1","first-page":"677","article-title":"Benchmarking Image Retrieval Diversification Techniques for Social Media","volume":"23","author":"Ionescu Bogdan","year":"2021","unstructured":"Bogdan Ionescu , Maia Rohm , Bogdan Boteanu , Alexandru-Lucian G\u00eensca , Mihai Lupu , and Henning M\u00fcller . 2021 . Benchmarking Image Retrieval Diversification Techniques for Social Media . IEEE TMM , Vol. 23 (2021), 677 -- 691 . https:\/\/doi.org\/10.1109\/TMM.2020.2986579 Bogdan Ionescu, Maia Rohm, Bogdan Boteanu, Alexandru-Lucian G\u00eensca, Mihai Lupu, and Henning M\u00fcller. 2021. Benchmarking Image Retrieval Diversification Techniques for Social Media. IEEE TMM, Vol. 23 (2021), 677--691. https:\/\/doi.org\/10.1109\/TMM.2020.2986579","journal-title":"IEEE TMM"},{"key":"e_1_3_2_2_14_1","volume-title":"International Conference on Machine Learning. PMLR, 4904--4916","author":"Jia Chao","year":"2021","unstructured":"Chao Jia , Yinfei Yang , Ye Xia , Yi-Ting Chen , Zarana Parekh , Hieu Pham , Quoc Le , Yun-Hsuan Sung , Zhen Li , and Tom Duerig . 2021 . Scaling up visual and vision-language representation learning with noisy text supervision . In International Conference on Machine Learning. PMLR, 4904--4916 . Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning. PMLR, 4904--4916."},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3356060"},{"key":"e_1_3_2_2_16_1","volume-title":"Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems","author":"Ke Guolin","year":"2017","unstructured":"Guolin Ke , Qi Meng , Thomas Finley , Taifeng Wang , Wei Chen , Weidong Ma , Qiwei Ye , and Tie-Yan Liu . 2017 . Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems , Vol. 30 (2017). Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, Vol. 30 (2017)."},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/2566486.2567996"},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3416273"},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2783258.2788582"},{"key":"e_1_3_2_2_20_1","volume-title":"Adapting CLIP For Phrase Localization Without Further Training. arXiv preprint arXiv:2204.03647","author":"Li Jiahao","year":"2022","unstructured":"Jiahao Li , Greg Shakhnarovich , and Raymond A Yeh . 2022. Adapting CLIP For Phrase Localization Without Further Training. arXiv preprint arXiv:2204.03647 ( 2022 ). Jiahao Li, Greg Shakhnarovich, and Raymond A Yeh. 2022. Adapting CLIP For Phrase Localization Without Further Training. arXiv preprint arXiv:2204.03647 (2022)."},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1177\/0022243719881113"},{"key":"e_1_3_2_2_22_1","volume-title":"Dressing for Attention: Outfit Based Fashion Popularity Prediction. In IEEE International Conference on Image Processing (ICIP).","author":"Lo Ling","year":"2019","unstructured":"Ling Lo , Chia-Lin Liu , Rong-An Lin , Bo Wu , and Wen-Huang Cheng . 2019 . Dressing for Attention: Outfit Based Fashion Popularity Prediction. In IEEE International Conference on Image Processing (ICIP). Ling Lo, Chia-Lin Liu, Rong-An Lin, Bo Wu, and Wen-Huang Cheng. 2019. Dressing for Attention: Outfit Based Fashion Popularity Prediction. In IEEE International Conference on Image Processing (ICIP)."},{"key":"e_1_3_2_2_23_1","volume-title":"Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860","author":"Luo Huaishao","year":"2021","unstructured":"Huaishao Luo , Lei Ji , Ming Zhong , Yang Chen , Wen Lei , Nan Duan , and Tianrui Li. 2021. Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860 ( 2021 ). Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2021. Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860 (2021)."},{"key":"e_1_3_2_2_24_1","volume-title":"Contextual video recommendation by multimodal relevance and user feedback. ACM Transactions on Information Systems","author":"Mei Tao","year":"2011","unstructured":"Tao Mei , Bo Yang , Xian-Sheng Hua , and Shipeng Li. 2011. Contextual video recommendation by multimodal relevance and user feedback. ACM Transactions on Information Systems ( 2011 ). Tao Mei, Bo Yang, Xian-Sheng Hua, and Shipeng Li. 2011. Contextual video recommendation by multimodal relevance and user feedback. ACM Transactions on Information Systems (2011)."},{"key":"e_1_3_2_2_25_1","volume-title":"Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734","author":"Mokady Ron","year":"2021","unstructured":"Ron Mokady , Amir Hertz , and Amit H Bermano . 2021 . Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734 (2021). Ron Mokady, Amir Hertz, and Amit H Bermano. 2021. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734 (2021)."},{"key":"e_1_3_2_2_26_1","volume-title":"Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741","author":"Nichol Alex","year":"2021","unstructured":"Alex Nichol , Prafulla Dhariwal , Aditya Ramesh , Pranav Shyam , Pamela Mishkin , Bob McGrew , Ilya Sutskever , and Mark Chen . 2021 . Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021). Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)."},{"key":"e_1_3_2_2_27_1","volume-title":"Anna Veronika Dorogush, and Andrey Gulin","author":"Prokhorenkova Liudmila","year":"2018","unstructured":"Liudmila Prokhorenkova , Gleb Gusev , Aleksandr Vorobev , Anna Veronika Dorogush, and Andrey Gulin . 2018 . CatBoost: unbiased boosting with categorical features. Advances in neural information processing systems, Vol. 31 (2018). Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. 2018. CatBoost: unbiased boosting with categorical features. Advances in neural information processing systems, Vol. 31 (2018)."},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2013.168"},{"key":"e_1_3_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3220077"},{"key":"e_1_3_2_2_30_1","volume-title":"International Conference on Machine Learning. PMLR, 8748--8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford , Jong Wook Kim , Chris Hallacy , Aditya Ramesh , Gabriel Goh , Sandhini Agarwal , Girish Sastry , Amanda Askell , Pamela Mishkin , Jack Clark , 2021 . Learning transferable visual models from natural language supervision . In International Conference on Machine Learning. PMLR, 8748--8763 . Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763."},{"key":"e_1_3_2_2_31_1","volume-title":"Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125","author":"Ramesh Aditya","year":"2022","unstructured":"Aditya Ramesh , Prafulla Dhariwal , Alex Nichol , Casey Chu , and Mark Chen . 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 ( 2022 ). Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)."},{"key":"e_1_3_2_2_32_1","volume-title":"Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al.","author":"Saharia Chitwan","year":"2022","unstructured":"Chitwan Saharia , William Chan , Saurabh Saxena , Lala Li , Jay Whang , Emily Denton , Seyed Kamyar Seyed Ghasemipour , Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. 2022 . Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding . arXiv preprint arXiv:2205.11487 (2022). Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv preprint arXiv:2205.11487 (2022)."},{"key":"e_1_3_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00372"},{"key":"e_1_3_2_2_34_1","volume-title":"ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension. arXiv preprint arXiv:2204.05991","author":"Subramanian Sanjay","year":"2022","unstructured":"Sanjay Subramanian , Will Merrill , Trevor Darrell , Matt Gardner , Sameer Singh , and Anna Rohrbach . 2022. ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension. arXiv preprint arXiv:2204.05991 ( 2022 ). Sanjay Subramanian, Will Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, and Anna Rohrbach. 2022. ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension. arXiv preprint arXiv:2204.05991 (2022)."},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/1787234.1787254"},{"key":"e_1_3_2_2_36_1","volume-title":"NIMA: Neural image assessment","author":"Talebi Hossein","year":"2018","unstructured":"Hossein Talebi and Peyman Milanfar . 2018 . NIMA: Neural image assessment . IEEE transactions on image processing, Vol. 27 , 8 (2018), 3998--4011. Hossein Talebi and Peyman Milanfar. 2018. NIMA: Neural image assessment. IEEE transactions on image processing, Vol. 27, 8 (2018), 3998--4011."},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1186\/s13174-014-0008-y"},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3416294"},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"crossref","unstructured":"Thomas Wolf Lysandre Debut Victor Sanh Julien Chaumond Clement Delangue Anthony Moi Pierric Cistac Tim Rault R\u00e9mi Louf Morgan Funtowicz etal 2019. Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).  Thomas Wolf Lysandre Debut Victor Sanh Julien Chaumond Clement Delangue Anthony Moi Pierric Cistac Tim Rault R\u00e9mi Louf Morgan Funtowicz et al. 2019. Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3356084"},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2964335"},{"key":"e_1_3_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/2600428.2609569"},{"key":"e_1_3_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3416274"},{"key":"e_1_3_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/2783258.2783401"},{"key":"e_1_3_2_2_45_1","volume-title":"Places: A 10 million image database for scene recognition","author":"Zhou Bolei","year":"2017","unstructured":"Bolei Zhou , Agata Lapedriza , Aditya Khosla , Aude Oliva , and Antonio Torralba . 2017 . Places: A 10 million image database for scene recognition . IEEE transactions on pattern analysis and machine intelligence, Vol. 40 , 6 (2017), 1452--1464. Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 40, 6 (2017), 1452--1464."}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3551576","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3551576","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:18Z","timestamp":1750182558000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3551576"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":45,"alternative-id":["10.1145\/3503161.3551576","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3551576","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}