{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,18]],"date-time":"2026-05-18T16:29:39Z","timestamp":1779121779759,"version":"3.51.4"},"reference-count":43,"publisher":"Cambridge University Press (CUP)","issue":"2","license":[{"start":{"date-parts":[[2009,4,6]],"date-time":"2009-04-06T00:00:00Z","timestamp":1238976000000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2010,4]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Pyramid annotation makes it possible to evaluate quantitatively and qualitatively the content of machine-generated (or human) summaries. Evaluation methods must prove themselves against the same measuring stick \u2013 evaluation \u2013 as other research methods. First, a formal assessment of pyramid data from the 2003 Document Understanding Conference (DUC) is presented; this addresses whether the form of annotation is reliable and whether score results are consistent across annotators. A combination of interannotator reliability measures of the two manual annotation phases (pyramid creation and annotation of system peer summaries against pyramid models), and significance tests of the similarity of system scores from distinct annotations, produces highly reliable results. The most rigorous test consists of a comparison of peer system rankings produced from two independent sets of pyramid and peer annotations, which produce essentially the same rankings. Three years of DUC data (2003, 2005, 2006) are used to assess the reliability of the method across distinct evaluation settings: distinct systems, document sets, summary lengths, and numbers of model summaries. This functional assessment addresses the method's ability to discriminate systems across years. Results indicate that the statistical power of the method is more than sufficient to identify statistically significant differences among systems, and that the statistical power varies little across the 3 years.<\/jats:p>","DOI":"10.1017\/s1351324909005051","type":"journal-article","created":{"date-parts":[[2009,4,6]],"date-time":"2009-04-06T08:29:15Z","timestamp":1239006555000},"page":"107-131","source":"Crossref","is-referenced-by-count":5,"title":["Formal and functional assessment of the pyramid method for summary content evaluation"],"prefix":"10.1017","volume":"16","author":[{"given":"REBECCA J.","family":"PASSONNEAU","sequence":"first","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2009,4,6]]},"reference":[{"key":"S1351324909005051_ref41","unstructured":"Turian J. , Shen L. , and Melamed I. D. 2003. Evaluation of machine translation and its evaluation. In Proceedings of MT Summit IX, pp. 386\u201393. New Orleans, LA, September 23\u201327."},{"key":"S1351324909005051_ref39","unstructured":"Sparck Jones K. , and Galliers J. R. 1993. Evaluating natural language processing systems. Technical Report 291, Computer Laboratory, University of Cambridge."},{"key":"S1351324909005051_ref26","doi-asserted-by":"publisher","DOI":"10.1145\/1233912.1233913"},{"key":"S1351324909005051_ref17","unstructured":"Hovy E. , Lin C.-Y. , and Zhou L. 2005. Evaluating DUC 2005 using basic elements. In Proceedings of the 2005 Document Understanding Workshop, Vancouver, BC, October 9\u201310."},{"key":"S1351324909005051_ref2","volume-title":"Language","author":"Bloomfield","year":"1933"},{"key":"S1351324909005051_ref36","unstructured":"Passonneau R. , Nenkova A. , McKeown K. , and Sigelman S. 2005. Applying the pyramid method in DUC 2005. In Proceedings of the 2005 Document Understanding Conference, Vancouver, BC, October 9\u201310."},{"key":"S1351324909005051_ref34","unstructured":"Passonneau R. , McKeown K. , and Sigelman S. 2006. Applying the pyramid method in the 2006 Document Understanding Conference. In Proceedings of the 2006 Document Understanding Conference, Brooklyn, NY, June 8\u20139."},{"key":"S1351324909005051_ref31","unstructured":"Passonneau R. May 26\u201328, 2004. Computing reliability for coreference annotation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), Lisbon, Portugal."},{"key":"S1351324909005051_ref37","doi-asserted-by":"crossref","unstructured":"Radev D. R. , Teufel S. , Saggion H. , Lam W. , Blitzer J. , Qi H. , Celebi A. , Liu D. , and Drabek E. 2003. Evaluation challenges in large-scale multi-document summarization: the MEAD project. In Proceedings of the 41st Association for Computational Linguistics, Sapporo, Japan, pp. 375\u201382. Association for Computational Linguistics. Morristown, NJ, USA.","DOI":"10.3115\/1075096.1075144"},{"key":"S1351324909005051_ref24","unstructured":"Nenkova A. 2005. Automatic text summarization of newswire: Lessons learned from the Document Understanding Conference. In Proceedings of the 20th National Conference on Artificial Intelligence (AAAI 2005), Pittsburgh, PA."},{"key":"S1351324909005051_ref10","doi-asserted-by":"publisher","DOI":"10.2307\/1932409"},{"key":"S1351324909005051_ref9","unstructured":"Dang H. T. 2007. Overview of DUC 2006. In Proceedings of the 2006 Document Understanding Conference, Brooklyn, NY, June 8\u20139, 2006."},{"key":"S1351324909005051_ref28","doi-asserted-by":"publisher","DOI":"10.1080\/00107510500052444"},{"key":"S1351324909005051_ref15","volume-title":"Recent Advances in Natural Language Processing (RANLP)","author":"Harnly","year":"2005"},{"key":"S1351324909005051_ref27","doi-asserted-by":"publisher","DOI":"10.1145\/1148170.1148269"},{"key":"S1351324909005051_ref8","volume-title":"Empirical Methods for Artificial Intelligence","author":"Cohen","year":"1995"},{"key":"S1351324909005051_ref6","doi-asserted-by":"publisher","DOI":"10.1002\/1532-2890(2000)9999:9999<::AID-ASI1073>3.0.CO;2-5"},{"key":"S1351324909005051_ref29","doi-asserted-by":"crossref","unstructured":"Papineni K. , Roukos S. , Ward T. , and Jing W.-Z. 2001. Bleu: a method for automatic evaluation of machine translation. Technical Report RC22176, IBM Research Division, Yorktown Heights, NY.","DOI":"10.3115\/1073083.1073135"},{"key":"S1351324909005051_ref4","unstructured":"Carlson L. , Conroy J. M. , Marcu D. , O'Leary D. P. , Okurowski M. E. , Taylor A. , and Wong W. 2001. An empirical study of the relation between abstracts, extracts, and the discourse structure of texts. In Proceedings of the Document Understanding Workshop (DUC-2001), New Orleans, LA, September 13\u201314."},{"key":"S1351324909005051_ref14","doi-asserted-by":"publisher","DOI":"10.1162\/coli.2006.32.2.263"},{"key":"S1351324909005051_ref35","unstructured":"Passonneau R. , and Nenkova A. 2003. Evaluating content selection in human- or machine-generated summaries: the pyramid scoring method. Technical Report CUCS-025-03, Columbia University, New York, NY."},{"key":"S1351324909005051_ref25","unstructured":"Nenkova A. , and Passonneau R. J. 2004. Evaluating content selection in summarization: The pyramid method. In Proceedings of the Joint Annual Meeting of Human Language Technology (HLT) and the North American Chapter of the Association for Computational Linguistics (NACL), Boston, MA."},{"key":"S1351324909005051_ref23","doi-asserted-by":"publisher","DOI":"10.1017\/S1351324901002741"},{"key":"S1351324909005051_ref40","unstructured":"Teufel S. , and van Halteren H. 2004. Evaluating information content by factoid analysis: human annotation and stability. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain."},{"key":"S1351324909005051_ref1","unstructured":"Artstein R. , and Poesio M 2005. Kappa cubed = alpha (or beta). Technical Report, NLE Technote 2005-01, University of Essex, Essex."},{"key":"S1351324909005051_ref30","unstructured":"Passonneau R. 1997. Applying reliability metrics to co-reference annotation. Technical Report CUCS-025-03, Columbia University, Department of Computer Science."},{"key":"S1351324909005051_ref3","first-page":"249","article-title":"Assessing agreement on classification tasks: the kappa statistic","volume":"22","author":"Carletta","year":"1996","journal-title":"Computational Linguistics"},{"key":"S1351324909005051_ref16","first-page":"287","article-title":"Towards a tool for the subjective assessment of speech system interfaces (SASSI)","volume":"6","author":"Hone","year":"2001","journal-title":"Natural Language Engineering (Special issue on Best Practice in Spoken Dialogue Systems"},{"key":"S1351324909005051_ref5","first-page":"410","article-title":"Evaluating message understanding systems: an analysis of the Third Message Understanding Conference","volume":"19","author":"Chinchor","year":"1993","journal-title":"Computational Linguistics"},{"key":"S1351324909005051_ref38","doi-asserted-by":"publisher","DOI":"10.1002\/asi.5090120210"},{"key":"S1351324909005051_ref11","unstructured":"Doddington G. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the ARPA Workshop on Human Language Technology, San Diego, CA, pp. 128\u201332."},{"key":"S1351324909005051_ref18","unstructured":"Hovy E. , Lin C.-Y. , Zhou L. , and Fukumoto J. 2006. Automated summarization evaluation with basic elements. In Proceedings of the Fifth Conference on Language Resources and Evaluation (LREC), Genoa, Italy, May 24\u201326."},{"key":"S1351324909005051_ref7","doi-asserted-by":"publisher","DOI":"10.1177\/001316446002000104"},{"key":"S1351324909005051_ref13","unstructured":"Fuentes M. , Gonzalez E. , Ferres D. , and Rodriguez H. 2005. QASUM-TALP at DUC 2005 automatically evaluated with a pyramid based metric. In Proceedings of the 2005 Document Understanding Conference, Vancouver, BC, October 9\u201310."},{"key":"S1351324909005051_ref19","first-page":"223","article-title":"Nouvelles recherches sur la distribution florale","volume":"44","author":"Jaccard","year":"1908","journal-title":"Bulletin de la Societe Vaudoise des Sciences Naturelles"},{"key":"S1351324909005051_ref22","unstructured":"Lin C.-Y. , and Hovy Eduard . 2002. Manual and automatic evaluation of summaries. In Proceedings of the Workshop on Summarization, Association for Computational Linguistics, Philadelphia, PA, July 11\u201312."},{"key":"S1351324909005051_ref32","unstructured":"Passonneau R. May 24\u201326, 2006. Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), Genoa, Italy."},{"key":"S1351324909005051_ref42","unstructured":"van Halteren H. , and Teufel S. 2003. Examining the consensus between human summaries: initial experiments with factoid analysis. In Proceedings of the Document Understanding Conference Workshop, Edmonton, Canada, May 31\u2013June 1."},{"key":"S1351324909005051_ref21","volume-title":"Content Analysis: An Introduction to its Methodology","author":"Krippendorff","year":"1980"},{"key":"S1351324909005051_ref12","first-page":"1289","article-title":"An extensive empirical study of feature selection metrics for text classification","volume":"3","author":"Forman","year":"2003","journal-title":"Journal of Machine Learning Research"},{"key":"S1351324909005051_ref20","first-page":"60","volume-title":"AAAI Intelligent Text Summarization Workshop","author":"Jing","year":"1998"},{"key":"S1351324909005051_ref43","volume-title":"Human Behavior and the Principle of Least Effort","author":"Zipf","year":"1949"},{"key":"S1351324909005051_ref33","unstructured":"Passonneau R. , Goodkind A. , and Levy E. 2007. Annotation of children's oral narrations: modeling emergent narrative skills for computational applications. In Proceedings of the 20th Annual Meeting of the Florida Artificial Intelligence Research Society (FLAIRS-20), Key West, FL."}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324909005051","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2019,4,28]],"date-time":"2019-04-28T20:09:15Z","timestamp":1556482155000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324909005051\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2009,4,6]]},"references-count":43,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2010,4]]}},"alternative-id":["S1351324909005051"],"URL":"https:\/\/doi.org\/10.1017\/s1351324909005051","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2009,4,6]]}}}