{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,15]],"date-time":"2026-07-15T11:52:42Z","timestamp":1784116362143,"version":"3.55.0"},"reference-count":21,"publisher":"National Academy of Sciences","issue":"43","content-domain":{"domain":["www.pnas.org"],"crossmark-restriction":true},"short-container-title":["Proc. Natl. Acad. Sci. U.S.A."],"published-print":{"date-parts":[[2011,10,25]]},"abstract":"<jats:p>Automated scene interpretation has benefited from advances in machine learning, and restricted tasks, such as face detection, have been solved with sufficient accuracy for restricted settings. However, the performance of machines in providing rich semantic descriptions of natural scenes from digital images remains highly limited and hugely inferior to that of humans. Here we quantify this \u201csemantic gap\u201d in a particular setting: We compare the efficiency of human and machine learning in assigning an image to one of two categories determined by the spatial arrangement of constituent parts. The images are not real, but the category-defining rules reflect the compositional structure of real images and the type of \u201creasoning\u201d that appears to be necessary for semantic parsing. Experiments demonstrate that human subjects grasp the separating principles from a handful of examples, whereas the error rates of computer programs fluctuate wildly and remain far behind that of humans even after exposure to thousands of examples. These observations lend support to current trends in computer vision such as integrating machine learning with parts-based modeling.<\/jats:p>","DOI":"10.1073\/pnas.1109168108","type":"journal-article","created":{"date-parts":[[2011,10,18]],"date-time":"2011-10-18T01:23:13Z","timestamp":1318900993000},"page":"17621-17625","update-policy":"https:\/\/doi.org\/10.1073\/pnas.cm10313","source":"Crossref","is-referenced-by-count":89,"title":["Comparing machines and humans on a visual categorization test"],"prefix":"10.1073","volume":"108","author":[{"given":"Fran\u00e7ois","family":"Fleuret","sequence":"first","affiliation":[{"name":"Idiap Research Institute, 1920 Martigny, Switzerland;"},{"name":"\u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne, 1015 Lausanne, Switzerland;"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ting","family":"Li","sequence":"additional","affiliation":[{"name":"Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21218; and"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Charles","family":"Dubout","sequence":"additional","affiliation":[{"name":"Idiap Research Institute, 1920 Martigny, Switzerland;"},{"name":"\u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne, 1015 Lausanne, Switzerland;"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Emma K.","family":"Wampler","sequence":"additional","affiliation":[{"name":"Department of Psychological and Brain Sciences, Johns Hopkins University, Baltimore, MD 21218"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Steven","family":"Yantis","sequence":"additional","affiliation":[{"name":"Department of Psychological and Brain Sciences, Johns Hopkins University, Baltimore, MD 21218"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Donald","family":"Geman","sequence":"additional","affiliation":[{"name":"Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21218; and"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"341","published-online":{"date-parts":[[2011,10,17]]},"reference":[{"key":"e_1_3_3_1_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2009.22"},{"key":"e_1_3_3_2_2","first-page":"259","volume-title":"Foundations and Trends in Computer Graphics and Vision","author":"Zhu S","year":"2006","unstructured":"S Zhu, D Mumford, A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision (Now Publishers) 2, 259\u2013362 (2006)."},{"key":"e_1_3_3_3_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-006-0033-9"},{"key":"e_1_3_3_4_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-010-0391-1"},{"key":"e_1_3_3_5_2","first-page":"771","article-title":"A short introduction to Boosting","volume":"14","author":"Freund Y","year":"1999","unstructured":"Y Freund, RE Schapire, A short introduction to Boosting. Journal of Japanese Society for Artificial Intelligence 14, 771\u2013780 (1999).","journal-title":"Journal of Japanese Society for Artificial Intelligence"},{"key":"e_1_3_3_6_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4757-2440-0"},{"key":"e_1_3_3_7_2","volume-title":"Pattern Classification and Scene Analysis","author":"Duda R","year":"1973","unstructured":"R Duda, P Hart Pattern Classification and Scene Analysis (John Wiley & Sons, New York, 1973)."},{"key":"e_1_3_3_8_2","volume-title":"Abstract Inference","author":"Grenander U","year":"1980","unstructured":"U Grenander Abstract Inference (John Wiley & Sons, New York, 1980)."},{"key":"e_1_3_3_9_2","doi-asserted-by":"publisher","DOI":"10.1007\/BF00410640"},{"key":"e_1_3_3_10_2","volume-title":"Gesetze des Sehens [Laws of seeing]","author":"Metzger W","year":"1936","unstructured":"W Metzger Gesetze des Sehens [Laws of seeing] (MIT Press, Cambridge, MA, Translated and reprinted in 2006. (1936)."},{"key":"e_1_3_3_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/T-C.1973.223602"},{"key":"e_1_3_3_12_2","doi-asserted-by":"publisher","DOI":"10.1038\/14819"},{"key":"e_1_3_3_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/34.655647"},{"key":"e_1_3_3_14_2","doi-asserted-by":"publisher","DOI":"10.1023\/B:VISI.0000013087.49260.fb"},{"key":"e_1_3_3_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2005.177"},{"key":"e_1_3_3_16_2","first-page":"88","volume-title":"Proceedings of the 12th IAPR International Conference on Computer Vision and Image Processing","author":"LeCun Y","year":"1994","unstructured":"Y LeCun, Y Bengio, Word-level training of a handwritten word recognizer based on convolutional neural networks. Proceedings of the 12th IAPR International Conference on Computer Vision and Image Processing Vol. II, 88\u201392 (1994)."},{"key":"e_1_3_3_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2005.239"},{"key":"e_1_3_3_18_2","doi-asserted-by":"crossref","first-page":"1134","DOI":"10.1109\/ICCV.2003.1238476","volume-title":"Proceedings of the Ninth IEEE International Conference on Computer Vision","volume":"2","author":"Li F","year":"2003","unstructured":"F Li, R Fergus, P Perona, A Bayesian approach to unsupervised one-shot learning of object categories. Proceedings of the Ninth IEEE International Conference on Computer Vision 2, 1134 (2003)."},{"key":"e_1_3_3_19_2","doi-asserted-by":"publisher","DOI":"10.1023\/B:VISI.0000042934.15159.49"},{"key":"e_1_3_3_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2009.167"},{"key":"e_1_3_3_21_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.2006.18.7.1527"}],"container-title":["Proceedings of the National Academy of Sciences"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/pnas.org\/doi\/pdf\/10.1073\/pnas.1109168108","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,4,12]],"date-time":"2022-04-12T20:53:21Z","timestamp":1649796801000},"score":1,"resource":{"primary":{"URL":"https:\/\/pnas.org\/doi\/full\/10.1073\/pnas.1109168108"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2011,10,17]]},"references-count":21,"journal-issue":{"issue":"43","published-print":{"date-parts":[[2011,10,25]]}},"alternative-id":["10.1073\/pnas.1109168108"],"URL":"https:\/\/doi.org\/10.1073\/pnas.1109168108","relation":{},"ISSN":["0027-8424","1091-6490"],"issn-type":[{"value":"0027-8424","type":"print"},{"value":"1091-6490","type":"electronic"}],"subject":[],"published":{"date-parts":[[2011,10,17]]},"assertion":[{"value":"2011-10-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}