{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,13]],"date-time":"2026-04-13T18:53:05Z","timestamp":1776106385009,"version":"3.50.1"},"reference-count":64,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2025,1,27]],"date-time":"2025-01-27T00:00:00Z","timestamp":1737936000000},"content-version":"vor","delay-in-days":386,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,1,7]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Experts in various fields routinely perform methodical writing tasks to plan, organize, and report their work. From a clinician writing a differential diagnosis for a patient, to a teacher writing a lesson plan for students, these tasks are pervasive, requiring to methodically generate structured long-form output for a given input. We develop a typology of methodical tasks structured in the form of a task objective, procedure, input, and output, and introduce DoLoMiTes, a novel benchmark with specifications for 519 such tasks elicited from hundreds of experts from across 25 fields. Our benchmark further contains specific instantiations of methodical tasks with concrete input and output examples (1,857 in total) which we obtain by collecting expert revisions of up to 10 model-generated examples of each task. We use these examples to evaluate contemporary language models, highlighting that automating methodical tasks is a challenging long-form generation problem, as it requires performing complex inferences, while drawing upon the given context as well as domain knowledge. Our dataset is available at https:\/\/dolomites-benchmark.github.io\/.<\/jats:p>","DOI":"10.1162\/tacl_a_00727","type":"journal-article","created":{"date-parts":[[2025,1,10]],"date-time":"2025-01-10T19:08:41Z","timestamp":1736536121000},"page":"1-29","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":1,"title":["<scp>Dolomites<\/scp>: Domain-Specific Long-Form Methodical\n                    Tasks"],"prefix":"10.1162","volume":"13","author":[{"given":"Chaitanya","family":"Malaviya","sequence":"first","affiliation":[{"name":"University of Pennsylvania, USA. cmalaviy@seas.upenn.edu"}]},{"given":"Priyanka","family":"Agrawal","sequence":"additional","affiliation":[{"name":"Google DeepMind, USA"}]},{"given":"Kuzman","family":"Ganchev","sequence":"additional","affiliation":[{"name":"Google DeepMind, USA"}]},{"given":"Pranesh","family":"Srinivasan","sequence":"additional","affiliation":[{"name":"Google DeepMind, USA"}]},{"given":"Fantine","family":"Huot","sequence":"additional","affiliation":[{"name":"Google DeepMind, USA"}]},{"given":"Jonathan","family":"Berant","sequence":"additional","affiliation":[{"name":"Google DeepMind, USA"}]},{"given":"Mark","family":"Yatskar","sequence":"additional","affiliation":[{"name":"University of Pennsylvania, USA"}]},{"given":"Dipanjan","family":"Das","sequence":"additional","affiliation":[{"name":"Google DeepMind, USA"}]},{"given":"Mirella","family":"Lapata","sequence":"additional","affiliation":[{"name":"Google DeepMind, UK"}]},{"given":"Chris","family":"Alberti","sequence":"additional","affiliation":[{"name":"Google DeepMind, USA"}]}],"member":"281","published-online":{"date-parts":[[2024,1,7]]},"reference":[{"key":"2025012714411224800_bib1","article-title":"The claude 3 model family: Opus, sonnet,\n                        haiku","author":"Anthropic"},{"issue":"5","key":"2025012714411224800_bib2","doi-asserted-by":"publisher","first-page":"277","DOI":"10.1038\/s42254-023-00581-4","article-title":"Science in the age of large language\n                        models","volume":"5","author":"Birhane","year":"2023","journal-title":"Nature Reviews Physics"},{"key":"2025012714411224800_bib3","article-title":"How novelists use generative language\n                        models: An exploratory user study.","volume-title":"HAI-GEN+\n                        user2agent@ IUI","author":"Calderwood","year":"2020"},{"key":"2025012714411224800_bib4","doi-asserted-by":"publisher","first-page":"15607","DOI":"10.18653\/v1\/2023.acl-long.870","article-title":"Can large language models be an alternative to human\n                        evaluations?","volume-title":"Proceedings of the 61st Annual\n                        Meeting of the Association for Computational Linguistics (Volume 1: Long\n                        Papers)","author":"Chiang","year":"2023"},{"issue":"240","key":"2025012714411224800_bib5","first-page":"1","article-title":"Palm: Scaling language modeling with\n                        pathways","volume":"24","author":"Chowdhery","year":"2023","journal-title":"Journal of Machine Learning\n                        Research"},{"key":"2025012714411224800_bib6","article-title":"Introducing Command R+:\n                        A scalable LLM built for business","author":"Cohere","year":"2024"},{"issue":"24-013","key":"2025012714411224800_bib7","doi-asserted-by":"publisher","DOI":"10.2139\/ssrn.4573321","article-title":"Navigating the jagged technological\n                        frontier: Field experimental evidence of the effects of ai on knowledge\n                        worker productivity and quality","author":"Dell\u2019Acqua","year":"2023","journal-title":"Harvard Business\n                        School Technology & Operations Mgt. Unit Working Paper"},{"issue":"11","key":"2025012714411224800_bib8","doi-asserted-by":"publisher","first-page":"688","DOI":"10.1038\/s44159-023-00241-5","article-title":"Using large language models in\n                        psychology","volume":"2","author":"Demszky","year":"2023","journal-title":"Nature Reviews Psychology"},{"key":"2025012714411224800_bib9","doi-asserted-by":"publisher","first-page":"1286","DOI":"10.18653\/v1\/2021.emnlp-main.98","article-title":"Documenting large webtext corpora: A case\n                        study on the colossal clean crawled corpus","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural\n                        Language Processing","author":"Dodge","year":"2021"},{"key":"2025012714411224800_bib10","article-title":"What\u2019s in my big data?","volume-title":"The Twelfth International Conference on Learning\n                        Representations","author":"Elazar","year":"2024"},{"issue":"6702","key":"2025012714411224800_bib11","doi-asserted-by":"publisher","first-page":"1306","DOI":"10.1126\/science.adj0998","article-title":"Gpts are gpts: Labor market impact potential of\n                        llms","volume":"384","author":"Eloundou","year":"2024","journal-title":"Science"},{"key":"2025012714411224800_bib12","first-page":"13","article-title":"Augmenting human intellect: A conceptual\n                        framework","volume-title":"Augmented Education in the Global\n                        Age","author":"Engelbart","year":"2023"},{"key":"2025012714411224800_bib13","doi-asserted-by":"publisher","first-page":"11397","DOI":"10.18653\/v1\/2024.acl-long.615","article-title":"Learning to plan and generate text with\n                        citations","volume-title":"Proceedings of the 62nd Annual Meeting\n                        of the Association for Computational Linguistics (Volume 1: Long\n                        Papers)","author":"Fierro","year":"2024"},{"issue":"20","key":"2025012714411224800_bib14","doi-asserted-by":"publisher","first-page":"22021","DOI":"10.1609\/aaai.v38i20.30205","article-title":"Medalign: A clinician-generated dataset for instruction\n                        following with electronic medical records","volume":"38","author":"Fleming","year":"2024","journal-title":"Proceedings of the AAAI Conference on Artificial\n                        Intelligence"},{"key":"2025012714411224800_bib15","doi-asserted-by":"publisher","DOI":"10.2139\/ssrn.4027030","article-title":"Natural language processing in legal\n                        tech","author":"Frankenreiter","year":"2022","journal-title":"Legal Tech and the Future of Civil Justice\n                        (David Engstrom ed.) Forthcoming"},{"key":"2025012714411224800_bib16","article-title":"Challenges in evaluating AI\n                        systems","author":"Ganguli","year":"2023"},{"key":"2025012714411224800_bib17","doi-asserted-by":"publisher","first-page":"1002","DOI":"10.1145\/3532106.3533533","article-title":"Sparks: Inspiration for science writing\n                        using language models","volume-title":"Proceedings of the 2022\n                        ACM Designing Interactive Systems Conference","author":"Gero","year":"2022"},{"key":"2025012714411224800_bib18","doi-asserted-by":"publisher","first-page":"15789","DOI":"10.18653\/v1\/2024.acl-long.841","article-title":"OLMo: Accelerating the science of language\n                        models","volume-title":"Proceedings of the 62nd Annual Meeting of\n                        the Association for Computational Linguistics (Volume 1: Long\n                        Papers)","author":"Groeneveld","year":"2024"},{"key":"2025012714411224800_bib19","doi-asserted-by":"publisher","DOI":"10.2139\/ssrn.4583531","article-title":"Legalbench: A collaboratively built benchmark for measuring\n                        legal reasoning in large language models","volume":"36","author":"Guha","year":"2024","journal-title":"Advances\n                        in Neural Information Processing Systems"},{"key":"2025012714411224800_bib20","doi-asserted-by":"publisher","first-page":"211","DOI":"10.1162\/tacl_a_00362","article-title":"Wikiasp: A dataset for multi-domain\n                        aspect-based summarization","volume":"9","author":"Hayashi","year":"2021","journal-title":"Transactions of the\n                        Association for Computational Linguistics"},{"key":"2025012714411224800_bib21","article-title":"Measuring massive multitask language\n                        understanding","author":"Hendrycks","year":"2021","journal-title":"Proceedings of the International\n                        Conference on Learning Representations (ICLR)"},{"key":"2025012714411224800_bib22","doi-asserted-by":"publisher","first-page":"3905","DOI":"10.18653\/v1\/2022.naacl-main.287","article-title":"TRUE: Re-evaluating factual consistency\n                        evaluation","volume-title":"Proceedings of the 2022 Conference of\n                        the North American Chapter of the Association for Computational Linguistics:\n                        Human Language Technologies","author":"Or","year":"2022"},{"key":"2025012714411224800_bib23","article-title":"Mixtral of experts","author":"Jiang","year":"2024","journal-title":"arXiv preprint arXiv:2401.04088v1"},{"issue":"14","key":"2025012714411224800_bib24","doi-asserted-by":"publisher","DOI":"10.3390\/app11146421","article-title":"What disease does this patient have? A\n                        large-scale open domain question answering dataset from medical\n                        exams","volume":"11","author":"Di","year":"2021","journal-title":"Applied Sciences"},{"key":"2025012714411224800_bib25","doi-asserted-by":"publisher","first-page":"2567","DOI":"10.18653\/v1\/D19-1259","article-title":"Pubmedqa: A dataset for biomedical\n                        research question answering","volume-title":"Proceedings of the\n                        2019 Conference on Empirical Methods in Natural Language Processing and the\n                        9th International Joint Conference on Natural Language Processing\n                        (EMNLP-IJCNLP)","author":"Jin","year":"2019"},{"key":"2025012714411224800_bib26","doi-asserted-by":"publisher","DOI":"10.21236\/ADA006655","article-title":"Derivation of new readability formulas\n                        (automated readability index, fog count and flesch reading ease formula) for\n                        navy enlisted personnel","author":"Kincaid","year":"1975"},{"key":"2025012714411224800_bib27","doi-asserted-by":"publisher","first-page":"4940","DOI":"10.18653\/v1\/2021.naacl-main.393","article-title":"Hurdles to progress in long-form question\n                        answering","volume-title":"Proceedings of the 2021 Conference of\n                        the North American Chapter of the Association for Computational Linguistics:\n                        Human Language Technologies","author":"Krishna","year":"2021"},{"key":"2025012714411224800_bib28","doi-asserted-by":"publisher","first-page":"452","DOI":"10.1162\/tacl_a_00276","article-title":"Natural questions: A benchmark for\n                        question answering research","volume":"7","author":"Kwiatkowski","year":"2019","journal-title":"Transactions of the\n                        Association for Computational Linguistics"},{"key":"2025012714411224800_bib29","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3491102.3502030","article-title":"Coauthor: Designing a human-ai collaborative writing dataset\n                        for exploring language model capabilities","volume-title":"Proceedings of the 2022 CHI conference on human factors in computing\n                        systems","author":"Lee","year":"2022"},{"issue":"13","key":"2025012714411224800_bib30","doi-asserted-by":"publisher","first-page":"1233","DOI":"10.1056\/NEJMsr2214184","article-title":"Benefits, limits, and risks of gpt-4 as an ai\n                        chatbot for medicine","volume":"388","author":"Lee","year":"2023","journal-title":"New England Journal of\n                        Medicine"},{"key":"2025012714411224800_bib31","doi-asserted-by":"publisher","DOI":"10.1145\/3613904.3642625","article-title":"The value, benefits, and concerns of generative ai-powered\n                        assistance in writing","volume-title":"Proceedings of the CHI\n                        Conference on Human Factors in Computing Systems","author":"Li","year":"2024"},{"key":"2025012714411224800_bib32","article-title":"Wildbench: Benchmarking llms with challenging tasks from real\n                        users in the wild","author":"Lin","year":"2024"},{"key":"2025012714411224800_bib33","first-page":"74","article-title":"ROUGE: A package for automatic evaluation of\n                        summaries","volume-title":"Text Summarization Branches\n                        Out","author":"Lin","year":"2004"},{"key":"2025012714411224800_bib34","doi-asserted-by":"publisher","first-page":"2122","DOI":"10.18653\/v1\/D16-1230","article-title":"How NOT to evaluate your dialogue system:\n                        An empirical study of unsupervised evaluation metrics for dialogue response\n                        generation","volume-title":"Proceedings of the 2016 Conference on\n                        Empirical Methods in Natural Language Processing","author":"Liu","year":"2016"},{"key":"2025012714411224800_bib35","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.naacl-long.167","article-title":"ExpertQA: Expert-curated questions and attributed\n                        answers","volume-title":"2024 Annual Conference of the North\n                        American Chapter of the Association for Computational Linguistics","author":"Malaviya","year":"2024"},{"key":"2025012714411224800_bib36","article-title":"An overview of bard: An early experiment with\n                        generative ai","volume":"2","author":"Manyika","year":"2023","journal-title":"AI. Google Static Documents"},{"key":"2025012714411224800_bib37","article-title":"Au\n                        large","author":"Mistral","year":"2024"},{"key":"2025012714411224800_bib38","doi-asserted-by":"publisher","DOI":"10.2139\/ssrn.4391243","article-title":"Using ai to implement effective teaching\n                        strategies in classrooms: Five strategies, including\n                    prompts","author":"Mollick","year":"2023","journal-title":"Including Prompts (March 17, 2023)"},{"key":"2025012714411224800_bib39","doi-asserted-by":"publisher","first-page":"974","DOI":"10.1162\/tacl_a_00583","article-title":"Conditional generation with a\n                        question-answering blueprint","volume":"11","author":"Narayan","year":"2023","journal-title":"Transactions of the\n                        Association for Computational Linguistics"},{"key":"2025012714411224800_bib40","article-title":"Ms marco: A human generated machine\n                        reading comprehension dataset","volume-title":"CoCo@NIPS","author":"Nguyen","year":"2016"},{"key":"2025012714411224800_bib41","doi-asserted-by":"publisher","first-page":"3016","DOI":"10.18653\/v1\/2023.findings-emnlp.200","article-title":"LEXTREME: A multi-lingual and multi-task\n                        benchmark for the legal domain","volume-title":"Findings of the\n                        Association for Computational Linguistics: EMNLP 2023","author":"Niklaus","year":"2023"},{"key":"2025012714411224800_bib42","doi-asserted-by":"publisher","first-page":"2241","DOI":"10.18653\/v1\/D17-1238","article-title":"Why we need new evaluation metrics for\n                        NLG","volume-title":"Proceedings of the 2017 Conference on\n                        Empirical Methods in Natural Language Processing","author":"Novikova","year":"2017"},{"issue":"6654","key":"2025012714411224800_bib43","doi-asserted-by":"publisher","first-page":"187","DOI":"10.1126\/science.adh2586","article-title":"Experimental evidence on the productivity\n                        effects of generative artificial intelligence","volume":"381","author":"Noy","year":"2023","journal-title":"Science"},{"key":"2025012714411224800_bib44","unstructured":"OpenAI. 2023. Gpt-4 technical\n                        report. ArXiv,\n                        abs\/2303.08774v6."},{"key":"2025012714411224800_bib45","doi-asserted-by":"publisher","first-page":"2375","DOI":"10.18653\/v1\/2023.emnlp-main.146","article-title":"The shifted and the overlooked: A task-oriented investigation\n                        of user-GPT interactions","volume-title":"Proceedings of the 2023\n                        Conference on Empirical Methods in Natural Language Processing","author":"Ouyang","year":"2023"},{"key":"2025012714411224800_bib46","doi-asserted-by":"publisher","DOI":"10.1038\/d41586-023-00500-8","article-title":"How nature readers are using\n                        chatgpt","author":"Owens","year":"2023","journal-title":"Nature"},{"key":"2025012714411224800_bib47","doi-asserted-by":"publisher","first-page":"2357","DOI":"10.18653\/v1\/D18-1258","article-title":"emrQA: A large corpus for question answering on electronic\n                        medical records","volume-title":"Proceedings of the 2018\n                        Conference on Empirical Methods in Natural Language Processing","author":"Pampari","year":"2018"},{"key":"2025012714411224800_bib48","article-title":"Llm evaluators recognize and favor their own\n                        generations","author":"Panickssery","year":"2024","journal-title":"arXiv preprint\n                        arXiv:2404.13076v1"},{"issue":"140","key":"2025012714411224800_bib49","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified\n                        text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"Journal of Machine\n                        Learning Research"},{"key":"2025012714411224800_bib50","article-title":"GPQA: A graduate-level google-proof\n                        q&a benchmark","volume-title":"First Conference on\n                        Language Modeling","author":"Rein","year":"2024"},{"key":"2025012714411224800_bib51","first-page":"42707","article-title":"Position: Application-driven innovation in\n                        machine learning","volume-title":"Proceedings of the 41st\n                        International Conference on Machine Learning","author":"Rolnick","year":"2024"},{"key":"2025012714411224800_bib52","doi-asserted-by":"publisher","first-page":"41","DOI":"10.18653\/v1\/E17-2007","article-title":"The limits of automatic summarisation\n                        according to ROUGE","volume-title":"Proceedings of the 15th\n                        Conference of the European Chapter of the Association for Computational\n                        Linguistics: Volume 2, Short Papers","author":"Schluter","year":"2017"},{"key":"2025012714411224800_bib53","doi-asserted-by":"publisher","first-page":"7881","DOI":"10.18653\/v1\/2020.acl-main.704","article-title":"BLEURT: Learning robust metrics for text\n                        generation","volume-title":"Proceedings of the 58th Annual\n                        Meeting of the Association for Computational Linguistics","author":"Sellam","year":"2020"},{"key":"2025012714411224800_bib54","doi-asserted-by":"publisher","first-page":"4215","DOI":"10.18653\/v1\/2023.findings-emnlp.278","article-title":"Large language models are not yet human-level evaluators for\n                        abstractive summarization","volume-title":"Findings of the\n                        Association for Computational Linguistics: EMNLP 2023","author":"Shen","year":"2023"},{"key":"2025012714411224800_bib55","first-page":"13158","article-title":"Multi-lexsum: Real-world summaries of\n                        civil rights lawsuits at multiple granularities","volume":"35","author":"Shen","year":"2022","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2025012714411224800_bib56","doi-asserted-by":"publisher","first-page":"15725","DOI":"10.18653\/v1\/2024.acl-long.840","article-title":"Dolma: An open corpus of three trillion tokens for language\n                        model pretraining research","volume-title":"Proceedings of the\n                        62nd Annual Meeting of the Association for Computational Linguistics (Volume\n                        1: Long Papers)","author":"Soldaini","year":"2024"},{"key":"2025012714411224800_bib57","article-title":"Gemini: A family of highly\n                        capable multimodal models","author":"Gemini\n                        Team","year":"2023","journal-title":"arXiv preprint\n                        arXiv:2312.11805v4"},{"key":"2025012714411224800_bib58","article-title":"Gemini 1.5: Unlocking\n                        multimodal understanding across millions of tokens of\n                        context","author":"Gemini\n                        Team","year":"2024","journal-title":"arXiv preprint\n                    arXiv:2403.05530v4"},{"issue":"1","key":"2025012714411224800_bib59","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s12859-015-0564-6","article-title":"An overview of the bioasq large-scale\n                        biomedical semantic indexing and question answering\n                        competition","volume":"16","author":"Tsatsaronis","year":"2015","journal-title":"BMC Bioinformatics"},{"issue":"7972","key":"2025012714411224800_bib60","doi-asserted-by":"publisher","first-page":"47","DOI":"10.1038\/s41586-023-06221-2","article-title":"Scientific discovery in the age of artificial\n                        intelligence","volume":"620","author":"Wang","year":"2023","journal-title":"Nature"},{"key":"2025012714411224800_bib61","doi-asserted-by":"publisher","first-page":"9440","DOI":"10.18653\/v1\/2024.acl-long.511","article-title":"Large language models are not fair\n                    evaluators","volume-title":"Proceedings of the 62nd Annual Meeting of\n                        the Association for Computational Linguistics (Volume 1: Long\n                        Papers)","author":"Wang","year":"2024"},{"key":"2025012714411224800_bib62","doi-asserted-by":"publisher","first-page":"680","DOI":"10.18653\/v1\/2024.acl-long.40","article-title":"FOFO: A benchmark to evaluate LLMs\u2019\n                        format-following capability","volume-title":"Proceedings of the\n                        62nd Annual Meeting of the Association for Computational Linguistics (Volume\n                        1: Long Papers)","author":"Xia","year":"2024"},{"key":"2025012714411224800_bib63","article-title":"(inthe) wildchat: 570k chatgpt interaction logs in the\n                        wild","volume-title":"The Twelfth International Conference on\n                        Learning Representations","author":"Zhao","year":"2023"},{"key":"2025012714411224800_bib64","article-title":"Judging LLM-as-a-judge with MT-bench and\n                        chatbot arena","volume-title":"Thirty-seventh Conference on\n                        Neural Information Processing Systems Datasets and Benchmarks\n                    Track","author":"Zheng","year":"2023"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00727\/2499788\/tacl_a_00727.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00727\/2499788\/tacl_a_00727.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,1,27]],"date-time":"2025-01-27T14:41:24Z","timestamp":1737988884000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00727\/127459\/Dolomites-Domain-Specific-Long-Form-Methodical"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,7]]},"references-count":64,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00727","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025]]},"published":{"date-parts":[[2024,1,7]]}}}