{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,28]],"date-time":"2025-10-28T15:13:24Z","timestamp":1761664404895},"reference-count":60,"publisher":"MIT Press - Journals","issue":"5","content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,4,15]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Visual understanding requires comprehending complex visual relations between objects within a scene. Here, we seek to characterize the computational demands for abstract visual reasoning. We do this by systematically assessing the ability of modern deep convolutional neural networks (CNNs) to learn to solve the synthetic visual reasoning test (SVRT) challenge, a collection of 23 visual reasoning problems. Our analysis reveals a novel taxonomy of visual reasoning tasks, which can be primarily explained by both the type of relations (same-different versus spatial-relation judgments) and the number of relations used to compose the underlying rules. Prior cognitive neuroscience work suggests that attention plays a key role in humans' visual reasoning ability. To test this hypothesis, we extended the CNNs with spatial and feature-based attention mechanisms. In a second series of experiments, we evaluated the ability of these attention networks to learn to solve the SVRT challenge and found the resulting architectures to be much more efficient at solving the hardest of these visual reasoning tasks. Most important, the corresponding improvements on individual tasks partially explained our novel taxonomy. Overall, this work provides a granular computational account of visual reasoning and yields testable neuroscience predictions regarding the differential need for feature-based versus spatial attention depending on the type of visual reasoning problem.<\/jats:p>","DOI":"10.1162\/neco_a_01485","type":"journal-article","created":{"date-parts":[[2022,3,2]],"date-time":"2022-03-02T00:51:17Z","timestamp":1646182277000},"page":"1075-1099","update-policy":"http:\/\/dx.doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":7,"title":["Understanding the Computational Demands Underlying Visual Reasoning"],"prefix":"10.1162","volume":"34","author":[{"given":"Mohit","family":"Vaishnav","sequence":"first","affiliation":[{"name":"Artificial and Natural Intelligence Toulouse Institute, Universit\u00e9 de Toulouse, 31052 Toulouse, France"},{"name":"Carney Institute for Brain Science, Department of Cognitive Linguistic and Psychological Sciences, Brown University, Providence, RI 02912, U.S.A. mohit.vaishnav@univ-toulouse.fr"}]},{"given":"Remi","family":"Cadene","sequence":"additional","affiliation":[{"name":"Carney Institute for Brain Science, Department of Cognitive Linguistic and Psychological Sciences, Brown University, Providence, RI 02912, U.S.A. remi.cadene@icloud.com"}]},{"given":"Andrea","family":"Alamia","sequence":"additional","affiliation":[{"name":"Centre de Recherche Cerveau et Cognition, CNRS, Universit\u00e9 de Toulouse, 31052 Toulouse, France artipago@gmail.com"}]},{"given":"Drew","family":"Linsley","sequence":"additional","affiliation":[{"name":"Carney Institute for Brain Science, Department of Cognitive Linguistic and Psychological Sciences, Brown University, Providence, RI 02912, U.S.A. drew_linsley@brown.edu"}]},{"given":"Rufin","family":"VanRullen","sequence":"additional","affiliation":[{"name":"Artificial and Natural Intelligence, Toulouse Institute, Universit\u00e9 de Toulouse, and Centre de Recherche Cerveau et Cognition, CNRS, Universit\u00e9 de Toulouse, 31052 Toulouse, France rufin.vanrullen@cnrs.fr"}]},{"given":"Thomas","family":"Serre","sequence":"additional","affiliation":[{"name":"Artificial and Natural Intelligence Toulouse Institute, Universit\u00e9 de Toulouse, 31052 Toulouse, France"},{"name":"Carney Institute for Brain Science, Department of Cognitive Linguistic and Psychological Sciences, Brown University, Providence, RI 02912, U.S.A. thomas_serre@brown.edu"}]}],"member":"281","published-online":{"date-parts":[[2022,4,15]]},"reference":[{"issue":"1","key":"2022042221321336200_B1","doi-asserted-by":"publisher","DOI":"10.1523\/ENEURO.0267-20.2020","article-title":"Differential involvement of EEG oscillatory components in sameness versus spatial-relation visual reasoning tasks.","volume":"8","author":"Alamia","year":"2021","journal-title":"eNeuro"},{"issue":"15","key":"2022042221321336200_B2","doi-asserted-by":"crossref","DOI":"10.1167\/15.15.6","article-title":"Contextual effects in visual working memory reveal hierarchically structured memory representations","volume":"15","author":"Brady","year":"2015","journal-title":"Journal of Vision"},{"key":"2022042221321336200_B3","author":"Carion","year":"2020","journal-title":"End-to-end object detection with transformers"},{"key":"2022042221321336200_B4","author":"Chen","year":"2015","journal-title":"ABC-CNN: An attention based convolutional neural network for visual question answering"},{"key":"2022042221321336200_B5","first-page":"5659","article-title":"Sca-CNN: Spatial and channel-wise attention in convolutional networks for image captioning.","author":"Chen","year":"2017","journal-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition"},{"issue":"7","key":"2022042221321336200_B6","doi-asserted-by":"publisher","first-page":"1933","DOI":"10.3758\/s13414-013-0601-3","article-title":"Working memory for relations among objects","volume":"76","author":"Clevenger","year":"2014","journal-title":"Attention, Perception, and Psychophysics"},{"key":"2022042221321336200_B7","doi-asserted-by":"crossref","first-page":"248","DOI":"10.1109\/CVPR.2009.5206848","article-title":"ImageNet: A large-scale hierarchical image database.","author":"Deng","year":"2009","journal-title":"Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition"},{"issue":"1","key":"2022042221321336200_B8","doi-asserted-by":"publisher","first-page":"193","DOI":"10.1146\/annurev.ne.18.030195.001205","article-title":"Neural mechanisms of selective visual attention","volume":"18","author":"Desimone","year":"1995","journal-title":"Annual Review of Neuroscience"},{"key":"2022042221321336200_B9","article-title":"Attention over learned object embeddings enables complex visual reasoning.","volume":"34","author":"Ding","year":"2021","journal-title":"Advances in neural information processing systems"},{"key":"2022042221321336200_B10","author":"Dosovitskiy","year":"2020","journal-title":"An image is worth 16 \u00d7 16 words: Transformers for image recognition at scale"},{"issue":"6","key":"2022042221321336200_B11","doi-asserted-by":"publisher","first-page":"380","DOI":"10.1111\/j.1467-9280.1994.tb00289.x","article-title":"Covert orienting in the split brain reveals hemispheric specialization for object-based attention","volume":"5","author":"Egly","year":"1994","journal-title":"Psychological Science"},{"key":"2022042221321336200_B12","article-title":"Unsupervised learning by program synthesis.","volume":"28","author":"Ellis","year":"2015","journal-title":"Advances in neural information processing systems"},{"issue":"1","key":"2022042221321336200_B13","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1167\/7.1.10","article-title":"What do we perceive in a glance of a real-world scene?","volume":"7","author":"Fei-Fei","year":"2007","journal-title":"J. Vis."},{"issue":"43","key":"2022042221321336200_B14","doi-asserted-by":"publisher","first-page":"26562","DOI":"10.1073\/pnas.1905334117","article-title":"Performance vs. competence in human\u2013machine comparisons","volume":"117","author":"Firestone","year":"2020","journal-title":"Proceedings of the National Academy of Sciences"},{"issue":"43","key":"2022042221321336200_B15","doi-asserted-by":"publisher","first-page":"17621","DOI":"10.1073\/pnas.1109168108","article-title":"Comparing machines and humans on a visual categorization test","volume":"108","author":"Fleuret","year":"2011","journal-title":"Proceedings of the National Academy of Sciences"},{"key":"2022042221321336200_B16","doi-asserted-by":"publisher","first-page":"63","DOI":"10.1016\/j.cobeha.2020.09.008","article-title":"Same\/different in visual reasoning","volume":"37","author":"Forbus","year":"2021","journal-title":"Current Opinion in Behavioral Sciences"},{"issue":"3","key":"2022042221321336200_B17","doi-asserted-by":"publisher","DOI":"10.1167\/jov.21.3.16","article-title":"Five points to check when comparing visual perception in humans and machines","volume":"21","author":"Funke","year":"2021","journal-title":"Journal of Vision"},{"issue":"11","key":"2022042221321336200_B18","doi-asserted-by":"publisher","first-page":"665","DOI":"10.1038\/s42256-020-00257-z","article-title":"Shortcut learning in deep neural networks","volume":"2","author":"Geirhos","year":"2020","journal-title":"Nature Machine Intelligence"},{"issue":"12","key":"2022042221321336200_B19","doi-asserted-by":"publisher","first-page":"3618","DOI":"10.1073\/pnas.1422953112","article-title":"Visual Turing test for computer vision systems","volume":"112","author":"Geman","year":"2015","journal-title":"Proc. Natl. Acad. Sci. U.S.A."},{"key":"2022042221321336200_B20","doi-asserted-by":"publisher","first-page":"84","DOI":"10.1016\/j.cobeha.2020.11.013","article-title":"Learning same and different relations: Cross-species comparisons","volume":"37","author":"Gentner","year":"2021","journal-title":"Current Opinion in Behavioral Sciences"},{"issue":"3","key":"2022042221321336200_B21","doi-asserted-by":"publisher","first-page":"2890","DOI":"10.1016\/j.neuroimage.2009.09.009","article-title":"Differential role of anterior prefrontal and premotor cortex in the processing of relational information","volume":"49","author":"Golde","year":"2010","journal-title":"NeuroImage"},{"key":"2022042221321336200_B22","author":"Greff","year":"2020","journal-title":"On the binding problem in artificial neural networks"},{"key":"2022042221321336200_B23","article-title":"Deep residual learning for image recognition.","author":"He","year":"2016","journal-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition"},{"issue":"13","key":"2022042221321336200_B24","doi-asserted-by":"publisher","first-page":"1135","DOI":"10.1016\/j.cub.2011.05.031","article-title":"Perceiving spatial relations via attentional tracking and shifting","volume":"21","author":"Holcombe","year":"2011","journal-title":"Curr. Biol."},{"key":"2022042221321336200_B25","first-page":"7132","article-title":"Squeeze-and-excitation networks.","author":"Hu","year":"2018","journal-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition"},{"issue":"4","key":"2022042221321336200_B26","doi-asserted-by":"publisher","DOI":"10.1098\/rsfs.2018.0011","article-title":"Not-so-CLEVR: learning same\u2013different relations strains feedforward neural networks","volume":"8","author":"Kim","year":"2018","journal-title":"Interface Focus"},{"key":"2022042221321336200_B27","author":"Kingma","year":"2014","journal-title":"Adam: A method for stochastic optimization"},{"issue":"1","key":"2022042221321336200_B28","doi-asserted-by":"publisher","first-page":"222","DOI":"10.1111\/nyas.14320","article-title":"Beyond the feedforward sweep: Feedback computations in the visual cortex","volume":"1464","author":"Kreiman","year":"2020","journal-title":"Ann. N.Y. Acad. Sci."},{"issue":"5","key":"2022042221321336200_B29","doi-asserted-by":"publisher","first-page":"477","DOI":"10.1093\/cercor\/12.5.477","article-title":"Recruitment of anterior dorsolateral prefrontal cortex in human reasoning: A parametric study of relational complexity","volume":"12","author":"Kroger","year":"2002","journal-title":"Cerebral Cortex"},{"key":"2022042221321336200_B30","author":"Lin","year":"2018","journal-title":"ResNet with one-neuron hidden layers is a universal approximator"},{"key":"2022042221321336200_B31","author":"Linsley","year":"2020","journal-title":"Recurrent neural circuits for contour detection"},{"key":"2022042221321336200_B32","author":"Linsley","year":"2018","journal-title":"Global-and-local attention networks for visual recognition"},{"key":"2022042221321336200_B33","author":"Linsley","year":"2018","journal-title":"Learning what and where to attend"},{"key":"2022042221321336200_B34","author":"Logan","year":"1994","journal-title":"On the ability to inhibit thought and action: A users' guide to the stop signal paradigm"},{"issue":"5","key":"2022042221321336200_B35","doi-asserted-by":"publisher","DOI":"10.1037\/0096-1523.20.5.1015","article-title":"Spatial attention and the apprehension of spatial relations","volume":"20","author":"Logan","year":"1994","journal-title":"Journal of Experimental Psychology: Human Perception and Performance"},{"key":"2022042221321336200_B36","doi-asserted-by":"crossref","DOI":"10.7551\/mitpress\/1187.001.0001","author":"Marcus","year":"2001","journal-title":"The algebraic mind: Integrating connectionism and cognitive science"},{"key":"2022042221321336200_B37","author":"Messina","year":"2021","journal-title":"Recurrent vision transformer for solving visual reasoning problems"},{"key":"2022042221321336200_B38","doi-asserted-by":"publisher","first-page":"75","DOI":"10.1016\/j.patrec.2020.12.019","article-title":"Solving the same-different task with convolutional neural networks","volume":"143","author":"Messina","year":"2021","journal-title":"Pattern Recognition Letters"},{"issue":"5","key":"2022042221321336200_B39","doi-asserted-by":"publisher","first-page":"1015","DOI":"10.1037\/0096-1523.20.5.1015","article-title":"Visual attention and the apprehension of spatial relations: The case of depth","volume":"20","author":"Moore","year":"1994","journal-title":"J. Exp. Psychol. Hum. Percept. Perform."},{"key":"2022042221321336200_B40","author":"Puebla","year":"2021","journal-title":"Can deep convolutional neural networks learn same-different relations"},{"key":"2022042221321336200_B41","author":"Ren","year":"2016","journal-title":"End-to-end instance segmentation and counting with recurrent attention"},{"key":"2022042221321336200_B42","doi-asserted-by":"publisher","first-page":"47","DOI":"10.1016\/j.cobeha.2020.08.008","article-title":"Same-different conceptualization: A machine vision perspective","volume":"37","author":"Ricci","year":"2021","journal-title":"Current Opinion in Behavioral Sciences"},{"issue":"6700","key":"2022042221321336200_B43","doi-asserted-by":"publisher","first-page":"376","DOI":"10.1038\/26475","article-title":"Object-based attention in the primary visual cortex of the macaque monkey","volume":"395","author":"Roelfsema","year":"1998","journal-title":"Nature"},{"issue":"2","key":"2022042221321336200_B44","doi-asserted-by":"publisher","first-page":"319","DOI":"10.3758\/BF03196288","article-title":"Attentional coding of categorical relations in scene perception: Evidence from the flicker paradigm","volume":"9","author":"Rosielle","year":"2002","journal-title":"Psychon. Bull. Rev."},{"key":"2022042221321336200_B45","author":"Sharma","year":"2015","journal-title":"Action recognition using visual attention"},{"issue":"3972","key":"2022042221321336200_B46","doi-asserted-by":"publisher","first-page":"701","DOI":"10.1126\/science.171.3972.701","article-title":"Mental rotation of three-dimensional objects","volume":"171","author":"Shepard","year":"1971","journal-title":"Science"},{"issue":"11","key":"2022042221321336200_B47","doi-asserted-by":"publisher","first-page":"8","DOI":"10.1167\/jov.21.11.8","article-title":"Evaluating the progress of deep learning for visual relational concepts","volume":"21","author":"Stabinger","year":"2021","journal-title":"Journal of Vision"},{"key":"2022042221321336200_B48","doi-asserted-by":"crossref","first-page":"380","DOI":"10.1007\/978-3-319-44781-0_45","article-title":"25 years of CNNs: Can we compare to human abstraction capabilities?","author":"Stabinger","year":"2016","journal-title":"Artificial Neural Networks and Machine Learning\u2013ICANN 2016"},{"key":"2022042221321336200_B49","first-page":"3545","article-title":"Deep networks with internal selective attention through feedback connections.","volume":"27","author":"Stollenga","year":"2014","journal-title":"Advances in neural information processing systems"},{"key":"2022042221321336200_B50","author":"Tolstikhin","year":"2021","journal-title":"MLP-mixer: An all-MLP architecture for vision"},{"key":"2022042221321336200_B51","author":"Touvron","year":"2021","journal-title":"Training data-efficient image transformers and distillation through attention"},{"key":"2022042221321336200_B52","first-page":"150","article-title":"Different binding strategies for the different stages of visual recognition.","author":"Tsotsos","year":"2007","journal-title":"Advances in brain, vision, and artificial intelligence"},{"issue":"6","key":"2022042221321336200_B53","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1371\/journal.pone.0038644","article-title":"Retinotopic mapping of categorical and coordinate spatial relation processing in early visual cortex","volume":"7","author":"Van Der Ham","year":"2012","journal-title":"PLOS One"},{"key":"2022042221321336200_B54","article-title":"Attention is all you need.","author":"Vaswani","year":"2017","journal-title":"Advances in neural information processing systems, 30"},{"issue":"9","key":"2022042221321336200_B55","doi-asserted-by":"publisher","first-page":"251","DOI":"10.1162\/neco_a_01413","article-title":"Do neural networks for segmentation understand insideness?","volume":"33","author":"Villalobos","year":"2021","journal-title":"Neural Computation"},{"key":"2022042221321336200_B56","first-page":"3","article-title":"CBAM: Convolutional block attention module","author":"Woo","year":"2018","journal-title":"Proceedings of the European Conference on Computer Vision"},{"key":"2022042221321336200_B57","author":"Xu","year":"2015","journal-title":"Ask, attend and answer: Exploring question-guided spatial attention for visual question answering"},{"key":"2022042221321336200_B58","doi-asserted-by":"crossref","first-page":"21","DOI":"10.1109\/CVPR.2016.10","article-title":"Stacked attention networks for image question answering.","author":"Yang","year":"2016","journal-title":"Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition"},{"key":"2022042221321336200_B59","author":"Yihe","year":"2019","journal-title":"Program synthesis performance constrained by non-linear spatial relations in synthetic visual reasoning test"},{"key":"2022042221321336200_B60","author":"Zhu","year":"2020","journal-title":"Deformable DETR: Deformable transformers for end-to-end object detection"}],"container-title":["Neural Computation"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/neco\/article-pdf\/34\/5\/1075\/2008682\/neco_a_01485.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/neco\/article-pdf\/34\/5\/1075\/2008682\/neco_a_01485.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,4,22]],"date-time":"2022-04-22T21:35:00Z","timestamp":1650663300000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/neco\/article\/34\/5\/1075\/109662\/Understanding-the-Computational-Demands-Underlying"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,4,15]]},"references-count":60,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2022,4,15]]},"published-print":{"date-parts":[[2022,4,15]]}},"URL":"https:\/\/doi.org\/10.1162\/neco_a_01485","relation":{},"ISSN":["0899-7667","1530-888X"],"issn-type":[{"value":"0899-7667","type":"print"},{"value":"1530-888X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022,5]]},"published":{"date-parts":[[2022,4,15]]}}}