{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,12]],"date-time":"2026-03-12T13:05:10Z","timestamp":1773320710533,"version":"3.50.1"},"reference-count":57,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T00:00:00Z","timestamp":1773187200000},"content-version":"vor","delay-in-days":6,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,3,5]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Since the SciCap dataset\u2019s launch in 2021, the research community has made significant progress in generating captions for scientific figures in scholarly articles. In 2023, the first SciCap Challenge took place, inviting global teams to use an expanded SciCap dataset to develop models for captioning diverse figure types across various academic fields. At the same time, text generation models advanced quickly, with many powerful pre-trained large multimodal models (LMMs) emerging that showed impressive capabilities in various vision-and-language tasks. This paper presents an overview of the first SciCap Challenge and details the performance of various models on its data, capturing a snapshot of the field\u2019s state. We found that professional editors overwhelmingly preferred figure captions generated by GPT-4V over those from all other models and even the original captions written by authors. Following this key finding, we conducted detailed analyses to answer this question: Have advanced LMMs solved the task of generating captions for scientific figures?<\/jats:p>","DOI":"10.1162\/tacl.a.653","type":"journal-article","created":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T19:37:10Z","timestamp":1773257830000},"page":"233-252","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":0,"title":["Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from\n                    <scp>SciCap<\/scp>\n                    Challenge 2023"],"prefix":"10.1162","volume":"14","author":[{"given":"Ting-Yao \u2018Edward\u2019","family":"Hsu","sequence":"first","affiliation":[{"name":"Pennsylvania State University, USA. txh357@psu.edu"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yi-Li","family":"Hsu","sequence":"additional","affiliation":[{"name":"National Tsing Hua University, Taiwan. yili.hsu@iis.sinica.edu.tw"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shaurya","family":"Rohatgi","sequence":"additional","affiliation":[{"name":"AllSci, USA. srohatgi@allsci.com"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chieh-Yang","family":"Huang","sequence":"additional","affiliation":[{"name":"MetaMetrics Inc., USA. cyhuang@lexile.com"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ho Yin Sam","family":"Ng","sequence":"additional","affiliation":[{"name":"Pennsylvania State University, USA. sam.ng@psu.edu"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ryan","family":"Rossi","sequence":"additional","affiliation":[{"name":"Adobe Research, USA. ryrossi@adobe.com"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sungchul","family":"Kim","sequence":"additional","affiliation":[{"name":"Adobe Research, USA. sukim@adobe.com"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tong","family":"Yu","sequence":"additional","affiliation":[{"name":"Adobe Research, USA. tyu@adobe.com"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lun-Wei","family":"Ku","sequence":"additional","affiliation":[{"name":"Institute of Information Science, Academia Sinica, Taiwan. lwku@iis.sinica.edu.tw"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Clyde Lee","family":"Giles","sequence":"additional","affiliation":[{"name":"Pennsylvania State University, USA. clg20@psu.edu"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ting-Hao \u2018Kenneth\u2019","family":"Huang","sequence":"additional","affiliation":[{"name":"Pennsylvania State University, USA. txh710@psu.edu"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"281","published-online":{"date-parts":[[2026,3,5]]},"reference":[{"key":"2026031115370956600_bib1","doi-asserted-by":"publisher","first-page":"67","DOI":"10.18653\/v1\/2024.eacl-long.5","article-title":"Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs","volume-title":"Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Balloccu","year":"2024"},{"key":"2026031115370956600_bib2","volume-title":"Human Cognition: Learning, Understanding, and Remembering","author":"Bransford","year":"1979"},{"key":"2026031115370956600_bib3","article-title":"Iterative aggregation method for solving principal component analysis problems","author":"Bulgakov","year":"2016","journal-title":"arXiv preprint arXiv:1602.08800"},{"key":"2026031115370956600_bib4","article-title":"The solution for the ICCV 2023 1st scientific figure captioning challenge","author":"Chao","year":"2023","journal-title":"arXiv preprint"},{"key":"2026031115370956600_bib5","doi-asserted-by":"publisher","first-page":"1537","DOI":"10.1109\/WACV45572.2020.9093592","article-title":"Figure captioning with relation maps for reasoning","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"Chen","year":"2020"},{"key":"2026031115370956600_bib6","doi-asserted-by":"publisher","first-page":"143","DOI":"10.1145\/2910896.2910904","article-title":"Pdffigures 2.0: Mining figures from research papers","volume-title":"Proceedings of the 16th ACM\/IEEE-CS on Joint Conference on Digital Libraries","author":"Clark","year":"2016"},{"key":"2026031115370956600_bib7","doi-asserted-by":"publisher","first-page":"4884","DOI":"10.18653\/v1\/P19-1483","article-title":"Handling divergent reference texts when evaluating table-to-text generation","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Dhingra","year":"2019"},{"key":"2026031115370956600_bib8","article-title":"PP-OCR: A practical ultra lightweight OCR system","author":"Yuning","year":"2020","journal-title":"arXiv preprint arXiv:2009.09941"},{"key":"2026031115370956600_bib9","doi-asserted-by":"publisher","first-page":"82","DOI":"10.1109\/P3HPC49587.2019.00013","article-title":"Clangjit: Enhancing c++ with just-in-time compilation","volume-title":"2019 IEEE\/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)","author":"Finkel","year":"2019"},{"issue":"2","key":"2026031115370956600_bib10","doi-asserted-by":"publisher","first-page":"121","DOI":"10.1198\/000313002317572790","article-title":"Let\u2019s practice what we preach: Turning tables into graphs","volume":"56","author":"Gelman","year":"2002","journal-title":"The American Statistician"},{"key":"2026031115370956600_bib11","doi-asserted-by":"publisher","first-page":"210","DOI":"10.3115\/v1\/E14-4041","article-title":"Finding middle ground? Multi-objective natural language generation from time-series data","volume-title":"Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers","author":"Gkatzia","year":"2014"},{"key":"2026031115370956600_bib12","article-title":"Gemma: Open models based on gemini research and technology","author":"Team","year":"2024","journal-title":"arXiv preprint arXiv:2403.08295"},{"key":"2026031115370956600_bib13","unstructured":"Google Research. 2022. Python rouge implementation. https:\/\/github.com\/google-research\/google-research\/tree\/master\/rouge"},{"issue":"2","key":"2026031115370956600_bib14","doi-asserted-by":"publisher","first-page":"108","DOI":"10.3138\/jsp.34.2.108","article-title":"Single authors are not alone: Colleagues often help","volume":"34","author":"Hartley","year":"2003","journal-title":"Journal of Scholarly Publishing"},{"issue":"6","key":"2026031115370956600_bib15","doi-asserted-by":"publisher","first-page":"717","DOI":"10.1006\/jmla.1993.1036","article-title":"Constructing mental models of machines from text and diagrams","volume":"32","author":"Hegarty","year":"1993","journal-title":"Journal of Memory and Language"},{"key":"2026031115370956600_bib16","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.nlp4science-1.7","article-title":"Scitune: Aligning large language models with scientific multimodal instructions","author":"Horawalavithana","year":"2023","journal-title":"arXiv preprint arXiv:2307.01139"},{"key":"2026031115370956600_bib17","doi-asserted-by":"crossref","first-page":"3258","DOI":"10.18653\/v1\/2021.findings-emnlp.277","article-title":"SciCap: Generating captions for scientific figures","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2021","author":"Hsu","year":"2021"},{"key":"2026031115370956600_bib18","doi-asserted-by":"crossref","DOI":"10.1145\/3613905.3650738","article-title":"SciCapenter: Supporting caption composition for scientific figures with machine-generated captions and ratings","volume-title":"Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems","author":"Hsu","year":"2024"},{"key":"2026031115370956600_bib19","doi-asserted-by":"publisher","first-page":"5464","DOI":"10.18653\/v1\/2023.findings-emnlp.363","article-title":"GPT-4 as an effective zero-shot evaluator for scientific figure captions","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2023","author":"Hsu","year":"2023"},{"key":"2026031115370956600_bib20","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.inlg-main.6","article-title":"Summaries as captions: Generating figure captions for scientific documents with automated text summarization","author":"Huang","year":"2023"},{"key":"2026031115370956600_bib21","article-title":"Mistral 7b","author":"Jiang","year":"2023","journal-title":"arXiv preprint arXiv:2310.06825"},{"key":"2026031115370956600_bib22","article-title":"Chart-to-text: A large-scale benchmark for chart summarization","author":"Kantharaj","year":"2022","journal-title":"arXiv preprint arXiv: 2203.06486"},{"key":"2026031115370956600_bib23","article-title":"ACL-fig: A dataset for scientific figure classification","author":"Karishma","year":"2023","journal-title":"arXiv preprint arXiv: 2301.12293"},{"key":"2026031115370956600_bib24","doi-asserted-by":"publisher","first-page":"3464","DOI":"10.18653\/v1\/2022.naacl-main.254","article-title":"Transparent human evaluation for image captioning","volume-title":"Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Kasai","year":"2022"},{"key":"2026031115370956600_bib25","article-title":"Multi-LLM collaborative caption generation in scientific documents","author":"Kim","year":"2025","journal-title":"arXiv preprint arXiv:2501.02552"},{"key":"2026031115370956600_bib26","article-title":"The semantic scholar open data platform","author":"Kinney","year":"2023","journal-title":"ArXiv"},{"issue":"5","key":"2026031115370956600_bib27","doi-asserted-by":"publisher","first-page":"340","DOI":"10.1002\/(SICI)1097-4571(199506)46:5&lt;340::AID-ASI5&gt;3.0.CO;2-S","article-title":"Multimedia and comprehension: The relationship among text, animation, and captions","volume":"46","author":"Large","year":"1995","journal-title":"Journal of the American Society for Information Science"},{"key":"2026031115370956600_bib28","article-title":"BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models","author":"Li","year":"2023"},{"key":"2026031115370956600_bib29","article-title":"SciGraphQA: A large-scale synthetic multi-turn question-answering dataset for scientific graphs","author":"Li","year":"2023","journal-title":"arXiv preprint arXiv:2308.03349"},{"key":"2026031115370956600_bib30","first-page":"74","article-title":"ROUGE: A package for automatic evaluation of summaries","volume-title":"Text Summarization Branches Out","author":"Lin","year":"2004"},{"key":"2026031115370956600_bib31","article-title":"Visual instruction tuning","author":"Liu","year":"2023"},{"key":"2026031115370956600_bib32","doi-asserted-by":"publisher","first-page":"2890","DOI":"10.18653\/v1\/2022.acl-long.207","article-title":"BRIO: Bringing order to abstractive summarization","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Liu","year":"2022"},{"key":"2026031115370956600_bib33","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.emnlp-main.906","article-title":"UniChart: A universal vision-language pretrained model for chart comprehension and reasoning","author":"Masry","year":"2023"},{"key":"2026031115370956600_bib34","article-title":"Understanding how paper writers use AI-generated captions in figure caption writing","volume-title":"2nd AI4Research Workshop: Towards a Knowledge-grounded Scientific Research Lifecycle","author":"Ho","year":"2025"},{"issue":"2","key":"2026031115370956600_bib35","doi-asserted-by":"publisher","first-page":"227","DOI":"10.1177\/002246698301700214","article-title":"Deaf students\u2019 learning from captioned instruction: The relationship between the visual and caption display","volume":"17","author":"Nugent","year":"1983","journal-title":"The Journal of Special Education"},{"key":"2026031115370956600_bib36","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.inlg-1.20","article-title":"Chart-to-text: Generating natural language descriptions for charts by adapting the transformer model","author":"Obeid","year":"2020","journal-title":"arXiv preprint arXiv:2010 .09142"},{"key":"2026031115370956600_bib37","doi-asserted-by":"publisher","first-page":"311","DOI":"10.3115\/1073083.1073135","article-title":"BLEU: A method for automatic evaluation of machine translation","volume-title":"Proceedings of the 40th annual meeting of the Association for Computational Linguistics","author":"Papineni","year":"2002"},{"key":"2026031115370956600_bib38","article-title":"Human evaluation of text-to-image models on a multi-task benchmark","author":"Petsiuk","year":"2022"},{"key":"2026031115370956600_bib39","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.emnlp-main.240","article-title":"Investigating efficiently extending transformers for long input summarization","author":"Phang","year":"2022"},{"key":"2026031115370956600_bib40","doi-asserted-by":"publisher","first-page":"2792","DOI":"10.1145\/3442381.3449923","article-title":"Generating accurate caption units for figure captioning","volume-title":"Proceedings of the Web Conference 2021","author":"Qian","year":"2021"},{"key":"2026031115370956600_bib41","article-title":"ChartSumm: A comprehensive benchmark for automatic chart summarization of long and short summaries","author":"Rahman","year":"2023","journal-title":"arXiv preprint arXiv:2304.13620"},{"key":"2026031115370956600_bib42","doi-asserted-by":"publisher","first-page":"10348","DOI":"10.18653\/v1\/2023.emnlp-main.640","article-title":"The ACL OCL corpus: Advancing open science in computational linguistics","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Rohatgi","year":"2023"},{"key":"2026031115370956600_bib43","doi-asserted-by":"publisher","first-page":"664","DOI":"10.1007\/978-3-319-46478-7_41","article-title":"Figureseer: Parsing result-figures in research papers","volume-title":"Computer Vision\u2013ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11\u201314, 2016, Proceedings, Part VII 14","author":"Siegel","year":"2016"},{"issue":"2","key":"2026031115370956600_bib44","doi-asserted-by":"publisher","first-page":"024022","DOI":"10.1103\/PhysRevD.95.024022","article-title":"Gravitational wave transient signal emission via ekman pumping in neutron stars during post-glitch relaxation phase","volume":"95","author":"Singh","year":"2017","journal-title":"Physical Review D"},{"key":"2026031115370956600_bib45","doi-asserted-by":"publisher","first-page":"21","DOI":"10.18653\/v1\/W19-2303","article-title":"How to compare summarizers without target length? Pitfalls, solutions and re-examination of the neural summarization literature","volume-title":"Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation","author":"Sun","year":"2019"},{"key":"2026031115370956600_bib46","doi-asserted-by":"publisher","first-page":"4560","DOI":"10.1109\/WACV57701.2024.00450","article-title":"SciOL and MuLMS-Img: Introducing a large-scale multimodal scientific dataset and models for image-text tasks in the scientific domain","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV)","author":"Tarsi","year":"2024"},{"key":"2026031115370956600_bib47","article-title":"Llama 2: Open foundation and fine-tuned chat models","author":"Touvron","year":"2023","journal-title":"arXiv preprint arXiv:2307.09288"},{"key":"2026031115370956600_bib48","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2017.93","article-title":"A data driven approach for compound figure separation using convolutional neural networks","volume-title":"The IAPR International Conference on Document Analysis and Recognition (ICDAR)","author":"Tsutsui","year":"2017"},{"key":"2026031115370956600_bib49","article-title":"ChartX & chartVLM: A versatile benchmark and foundation model for complicated chart reasoning","author":"Xia","year":"2024","journal-title":"arXiv preprint arXiv:2402.12185"},{"key":"2026031115370956600_bib50","article-title":"EvaLAI: Towards better evaluation systems for AI agents","author":"Yadav","year":"2019"},{"key":"2026031115370956600_bib51","article-title":"Scicap+: A knowledge augmented dataset to study the challenges of scientific figure captioning","author":"Yang","year":"2023","journal-title":"arXiv preprint arXiv:2306.03491"},{"key":"2026031115370956600_bib52","article-title":"mPLUG-Owl2: Revolutionizing multi-modal large language model with modality collaboration","author":"Ye","year":"2023","journal-title":"arXiv preprint arXiv:2311.04257"},{"key":"2026031115370956600_bib53","article-title":"A solution to the 1st scientific figure captioning (scicap) challenge","author":"Jun","year":"2023","journal-title":"arXiv preprint"},{"key":"2026031115370956600_bib54","article-title":"MMMU: A massive multi- discipline multimodal understanding and reasoning benchmark for expert AGI","author":"Yue","year":"2023","journal-title":"arXiv preprint arXiv:2311.16502"},{"key":"2026031115370956600_bib55","first-page":"11328","article-title":"Pegasus: Pre-training with extracted gap-sentences for abstractive summarization","volume-title":"International Conference on Machine Learning","author":"Zhang","year":"2020"},{"key":"2026031115370956600_bib56","article-title":"OPT: Open pre-trained transformer language models","author":"Zhang","year":"2022","journal-title":"arXiv preprint arXiv:2205.01068"},{"key":"2026031115370956600_bib57","article-title":"MiniGPT-4: Enhancing vision-language understanding with advanced large language models","author":"Zhu","year":"2023","journal-title":"arXiv preprint arXiv:2304.10592"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/TACL.a.653\/2587241\/tacl.a.653.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/TACL.a.653\/2587241\/tacl.a.653.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T19:37:14Z","timestamp":1773257834000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/TACL.a.653\/135736\/Do-Large-Multimodal-Models-Solve-Caption"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,5]]},"references-count":57,"URL":"https:\/\/doi.org\/10.1162\/tacl.a.653","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,3,5]]}}}