{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,24]],"date-time":"2026-02-24T03:43:39Z","timestamp":1771904619538,"version":"3.50.1"},"reference-count":73,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2023,1,30]],"date-time":"2023-01-30T00:00:00Z","timestamp":1675036800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Artif. Intell."],"abstract":"<jats:p>While affordance detection and Human-Object interaction (HOI) detection tasks are related, the theoretical foundation of affordances makes it clear that the two are distinct. In particular, researchers in affordances make distinctions between J. J. Gibson's traditional definition of an affordance, \u201cthe action possibilities\u201d of the object within the environment, and the definition of a<jats:italic>telic<\/jats:italic>affordance, or one defined by conventionalized purpose or use. We augment the HICO-DET dataset with annotations for Gibsonian and telic affordances and a subset of the dataset with annotations for the orientation of the humans and objects involved. We then train an adapted Human-Object Interaction (HOI) model and evaluate a pre-trained viewpoint estimation system on this augmented dataset. Our model, AffordanceUPT, is based on a two-stage adaptation of the Unary-Pairwise Transformer (UPT), which we modularize to make affordance detection independent of object detection. Our approach exhibits generalization to new objects and actions, can effectively make the Gibsonian\/telic distinction, and shows that this distinction is correlated with features in the data that are not captured by the HOI annotations of the HICO-DET dataset.<\/jats:p>","DOI":"10.3389\/frai.2023.1084740","type":"journal-article","created":{"date-parts":[[2023,1,30]],"date-time":"2023-01-30T08:14:01Z","timestamp":1675066441000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["Grounding human-object interaction to affordance behavior in multimodal datasets"],"prefix":"10.3389","volume":"6","author":[{"given":"Alexander","family":"Henlein","sequence":"first","affiliation":[]},{"given":"Anju","family":"Gopinath","sequence":"additional","affiliation":[]},{"given":"Nikhil","family":"Krishnaswamy","sequence":"additional","affiliation":[]},{"given":"Alexander","family":"Mehler","sequence":"additional","affiliation":[]},{"given":"James","family":"Pustejovsky","sequence":"additional","affiliation":[]}],"member":"1965","published-online":{"date-parts":[[2023,1,30]]},"reference":[{"key":"B1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2204.01691","article-title":"Do as i can and not as i say: grounding language in robotic affordances","author":"Ahn","year":"2022","journal-title":"arXiv preprint"},{"key":"B2","first-page":"2425","article-title":"Vqa: visual question answering,","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Antol","year":"2015"},{"key":"B3","first-page":"5449","article-title":"From human instructions to robot actions: formulation of goals, affordances and probabilistic planning,","volume-title":"2016 IEEE International Conference on Robotics and Automation","author":"Antunes","year":"2016"},{"key":"B4","first-page":"9453","article-title":"Objectnet: a large-scale bias-controlled dataset for pushing the limits of object recognition models","volume":"32","author":"Barbu","year":"2019","journal-title":"Adv. Neural Inform. Process. Syst"},{"key":"B5","doi-asserted-by":"publisher","first-page":"2425","DOI":"10.3233\/FAIA200374","article-title":"A formalmodel of affordances for flexible robotic task execution,","author":"Be\u00dfler","year":"2020"},{"key":"B6","doi-asserted-by":"crossref","DOI":"10.1109\/CVPR52688.2022.01547","article-title":"Behave: dataset and method for tracking human object interactions,","volume-title":"IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Bhatnagar","year":"2022"},{"key":"B7","author":"Biewald","year":"2020","journal-title":"Experiment Tracking With Weights and Biases"},{"key":"B8","article-title":"Grounding spoken words in unlabeled video,","volume-title":"CVPR Workshops, Vol. 2","author":"Boggust","year":"2019"},{"key":"B9","doi-asserted-by":"publisher","first-page":"64","DOI":"10.1016\/j.bandc.2012.04.007","article-title":"One hand, two objects: emergence of affordance in contexts","volume":"80","author":"Borghi","year":"2012","journal-title":"Brain Cogn"},{"key":"B10","first-page":"11621","article-title":"Nuscenes: a multimodal dataset for autonomous driving,","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Caesar","year":"2020"},{"key":"B11","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_13","article-title":"End-to-end object detection with transformers","author":"Carion","year":"2020","journal-title":"CoRR, abs"},{"key":"B12","doi-asserted-by":"crossref","DOI":"10.3115\/v1\/P15-1006","article-title":"Text to 3d scene generation with rich lexical grounding,","volume-title":"Association for Computational Linguistics and International Joint Conference on Natural Language Processing","author":"Chang","year":"2015"},{"key":"B13","doi-asserted-by":"crossref","first-page":"381","DOI":"10.1109\/WACV.2018.00048","article-title":"Learning to detect human object interactions,","volume-title":"2018 IEEE Winter Conference on Applications of Computer Vision (WACV)","author":"Chao","year":"2018"},{"key":"B14","first-page":"4259","article-title":"Mining semantic affordances of visual object categories,","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Chao","year":"2015"},{"key":"B15","first-page":"888","article-title":"Automatic acquisition of ranked qualia structures from the web,","volume-title":"Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics","author":"Cimiano","year":"2007"},{"key":"B16","first-page":"720","article-title":"Scaling egocentric vision: the epic-kitchens dataset,","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Damen","year":"2018"},{"key":"B17","first-page":"1878","article-title":"3dposelite: a compact 3d pose estimation using node embeddings,","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"Dani","year":"2021"},{"key":"B18","unstructured":"DuttaA. GuptaA. ZissermannA. VGG Image Annotator (VIA). Version: 2.0.112016"},{"key":"B19","doi-asserted-by":"crossref","DOI":"10.1145\/3343031.3350535","article-title":"The VIA annotation software for images, audio and video,","volume-title":"Proceedings of the 27th ACM International Conference on Multimedia, MM '19","author":"Dutta","year":"2019"},{"key":"B20","doi-asserted-by":"crossref","DOI":"10.1007\/3-540-55966-3_10","volume-title":"Using Orientation Information for Qualitative Spatial Reasoning.","author":"Freksa","year":"1992"},{"key":"B21","doi-asserted-by":"crossref","first-page":"64","DOI":"10.18653\/v1\/P17-2011","article-title":"An analysis of action recognition datasets for language and vision tasks,","volume-title":"Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)","author":"Gella","year":"2017"},{"key":"B22","first-page":"67","article-title":"The theory of affordances","volume":"1","author":"Gibson","year":"1977","journal-title":"Hilldale"},{"key":"B23","first-page":"5842","article-title":"The \u201csomething something\u201d video database for learning and evaluating visual common sense,","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Goyal","year":"2017"},{"key":"B24","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1505.04474","article-title":"Visual semantic role labeling","author":"Gupta","year":"2015","journal-title":"arXiv preprint"},{"key":"B25","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3446370","article-title":"Visual affordance and function understanding: a survey","volume":"54","author":"Hassanin","year":"2021","journal-title":"ACM Comput. Surv"},{"key":"B26","doi-asserted-by":"crossref","first-page":"770","DOI":"10.1109\/CVPR.2016.90","article-title":"Deep residual learning for image recognition,","volume-title":"2016 IEEE Conference on Computer Vision and Pattern Recognition","author":"He","year":"2016"},{"key":"B27","first-page":"70","article-title":"Affordances for robots: a brief survey. AVANT","volume":"2","author":"Horton","year":"2012","journal-title":"Pismo Awangardy Filozoficzno Naukowej"},{"key":"B28","first-page":"495","article-title":"Affordance transfer learning for human-object interaction detection,","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Hou","year":""},{"key":"B29","doi-asserted-by":"crossref","DOI":"10.1109\/CVPR46437.2021.01441","article-title":"Detecting human-object interaction via fabricated compositional learning,","volume-title":"CVPR","author":"Hou","year":""},{"key":"B30","first-page":"470","article-title":"Multimodal pretraining for dense video captioning,","volume-title":"Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing","author":"Huang","year":"2020"},{"key":"B31","first-page":"74","article-title":"Hotr: endto-end human-object interaction detection with transformers,","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Kim","year":"2021"},{"key":"B32","doi-asserted-by":"crossref","first-page":"5578","DOI":"10.1109\/ICRA.2014.6907679","article-title":"Semantic labeling of 3d point clouds with object affordance for robot manipulation,","volume-title":"2014 IEEE International Conference on Robotics and Automation (ICRA)","author":"Kim","year":"2014"},{"key":"B33","doi-asserted-by":"publisher","first-page":"32","DOI":"10.1007\/s11263-016-0981-7","article-title":"Visual genome: connecting language and vision using crowdsourced dense image annotations","volume":"123","author":"Krishna","year":"2016","journal-title":"Int. J. Comput. Vision"},{"key":"B34","first-page":"10166","article-title":"Detailed 2d-3d joint representation for human-object interaction,","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Li","year":"2020"},{"key":"B35","first-page":"740","article-title":"Microsoft coco: common objects in context,","volume-title":"European Conference on Computer Vision","author":"Lin","year":"2014"},{"key":"B36","doi-asserted-by":"crossref","DOI":"10.1609\/aaai.v28i1.9051","article-title":"Learning from unscripted deictic gesture and language for human-robot interactions,","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Matuszek","year":"2014"},{"key":"B37","doi-asserted-by":"publisher","first-page":"25966","DOI":"10.1073\/pnas.1910416117","article-title":"Placing language in an integrated understanding system: next steps toward human-level performance in neural language models","volume":"117","author":"McClelland","year":"2020","journal-title":"Proc. Natl. Acad. Sci. U.S.A"},{"key":"B38","first-page":"23","article-title":"Action identification and local equivalence of action verbs: the annotation framework of the imagact ontology,","volume-title":"Proceedings of the LREC 2018 Workshop AREA. Annotation, Recognition and Evaluation of Actions","author":"Moneglia","year":"2018"},{"key":"B39","doi-asserted-by":"crossref","first-page":"1374","DOI":"10.1109\/ICRA.2015.7139369","article-title":"Affordance detection of tool parts from geometric features,","volume-title":"2015 IEEE International Conference on Robotics and Automation (ICRA)","author":"Myers","year":"2015"},{"key":"B40","doi-asserted-by":"publisher","first-page":"512","DOI":"10.1016\/j.neuroscience.2015.09.060","article-title":"The visual encoding of tool-object affordances","volume":"310","author":"Natraj","year":"2015","journal-title":"Neuroscience"},{"key":"B41","first-page":"1407","article-title":"In defense of scene graphs for image captioning,","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Nguyen","year":"2021"},{"key":"B42","doi-asserted-by":"crossref","DOI":"10.1109\/CVPR52688.2022.00665","article-title":"Templates for 3d object pose estimation revisited: generalization to new objects and robustness to occlusions,","volume-title":"Proceedings IEEE Conf. on Computer Vision and Pattern Recognition","author":"Nguyen","year":"2022"},{"key":"B43","doi-asserted-by":"publisher","first-page":"73","DOI":"10.1017\/S0140525X0200002X","article-title":"Two visual systems and two theories of perception: an attempt to reconcile the constructivist and ecological approaches","volume":"25","author":"Norman","year":"2002","journal-title":"Behav. Brain Sci"},{"key":"B44","doi-asserted-by":"publisher","first-page":"403","DOI":"10.1016\/j.neubiorev.2017.04.014","article-title":"What is an affordance? 40 years later","volume":"77","author":"Osiurak","year":"2017","journal-title":"Neurosci. Biobehav. Rev"},{"key":"B45","article-title":"A reinforcement learning approach for enacting cautious behaviours in autonomous driving system: safe speed choice in the interaction with distracted pedestrians,","volume-title":"IEEE Transactions on Intelligent Transportation Systems","author":"Papini","year":"2021"},{"key":"B46","doi-asserted-by":"crossref","DOI":"10.7551\/mitpress\/3225.001.0001","volume-title":"The Generative Lexicon","author":"Pustejovsky","year":"1995"},{"key":"B47","first-page":"1","article-title":"Dynamic event structure and habitat theory,","volume-title":"Proceedings of the 6th International Conference on Generative Approaches to the Lexicon","author":"Pustejovsky","year":"2013"},{"key":"B48","first-page":"4606","article-title":"VoxML: a visualization modeling language,","author":"Pustejovsky","year":"2016","journal-title":"Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)"},{"key":"B49","first-page":"8748","author":"Radford","year":"2021"},{"key":"B50","first-page":"70","article-title":"Disambiguation of basic action types through nouns' telic qualia,","volume-title":"Proceedings of the 6th International Conference on Generative Approaches to the Lexicon","author":"Russo","year":"2013"},{"key":"B51","doi-asserted-by":"publisher","DOI":"10.21437\/GLU.2017-17","article-title":"Interactive robot learning of gestures, language and affordances","author":"Saponaro","year":"2017","journal-title":"arXiv preprint"},{"key":"B52","doi-asserted-by":"crossref","first-page":"1568","DOI":"10.1109\/WACV.2018.00181","article-title":"Scaling human-object interaction recognition through zero-shot learning,","volume-title":"2018 IEEE Winter Conference on Applications of Computer Vision (WACV)","author":"Shen","year":"2018"},{"key":"B53","doi-asserted-by":"crossref","DOI":"10.1109\/CVPR46437.2021.01027","article-title":"QPIC: Query-based pairwise human-object interaction detection with image-wide contextual information,","volume-title":"CVPR","author":"Tamura","year":"2021"},{"key":"B54","first-page":"1691","article-title":"Language grounding with 3d objects,","volume-title":"Conference on Robot Learning","author":"Thomason","year":"2022"},{"key":"B55","doi-asserted-by":"publisher","first-page":"51","DOI":"10.1162\/001152604772746693","article-title":"Learning through others","volume":"133","author":"Tomasello","year":"2004","journal-title":"Daedalus"},{"key":"B56","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1809.10790","article-title":"Deep object pose estimation for semantic robotic grasping of household objects","author":"Tremblay","year":"2018","journal-title":"arXiv preprint"},{"key":"B57","first-page":"6000","article-title":"Attention is all you need,","volume-title":"Advances in Neural Information Processing Systems, Vol. 30","author":"Vaswani","year":"2017"},{"key":"B58","first-page":"11652","article-title":"Discovering1 human interactions with novel objects via zero-shot learning,","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Wang","year":"2020"},{"key":"B59","doi-asserted-by":"publisher","first-page":"1","DOI":"10.48550\/arXiv.2012.04456","article-title":"Understanding how dimension reduction tools work: an empirical approach to deciphering t-sne, umap, trimap, and pacmap for data visualization","volume":"22","author":"Wang","year":"2021","journal-title":"J. Mach. Learn. Res"},{"key":"B60","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-319-46484-8_10","article-title":"Objectnet3d: A large scale database for 3d object recognition,","volume-title":"European Conference Computer Vision","author":"Xiang","year":"2016"},{"key":"B61","doi-asserted-by":"crossref","DOI":"10.1109\/3DV53792.2021.00018","article-title":"Posecontrast: Class-agnostic object viewpoint estimation in the wild with pose-aware contrastive learning,","volume-title":"International Conference on 3D Vision","author":"Xiao","year":"2021"},{"key":"B62","article-title":"Pose fromshape: Deep pose estimation for arbitrary 3D objects,","volume-title":"British Machine Vision Conference","author":"Xiao","year":"2019"},{"key":"B63","doi-asserted-by":"publisher","first-page":"30","DOI":"10.18653\/v1\/2020.nlpbt-1.4","article-title":"A benchmark for structured procedural knowledge extraction from cooking videos,","author":"Xu","year":"2020","journal-title":"Proceedings of the First International Workshop on Natural Language Processing Beyond Text"},{"key":"B64","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2108.08420","article-title":"D3d-hoi: dynamic 3d human-object interactions from videos","author":"Xu","year":"2021","journal-title":"arXiv preprint"},{"key":"B65","doi-asserted-by":"publisher","first-page":"1534","DOI":"10.1093\/ietisy\/e90-d.10.1534","article-title":"Automatic acquisition of qualia structure from corpus data","volume":"90","author":"Yamada","year":"2007","journal-title":"IEICE Trans. Inform. Syst"},{"key":"B66","doi-asserted-by":"crossref","first-page":"17","DOI":"10.1109\/CVPR.2010.5540235","article-title":"Modeling mutual context of object and human pose in human-object interaction activities,","volume-title":"2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition","author":"Yao","year":"2010"},{"key":"B67","doi-asserted-by":"publisher","first-page":"812","DOI":"10.1037\/a0017175","article-title":"The paired-object affordance effect","volume":"36","author":"Yoon","year":"2010","journal-title":"J. Exp. Psychol. Hum. Percept. Perform"},{"key":"B68","doi-asserted-by":"publisher","first-page":"134","DOI":"10.1016\/j.bandc.2006.04.002","article-title":"Are different affordances subserved by different neural pathways?","volume":"62","author":"Young","year":"2006","journal-title":"Brain Cogn"},{"key":"B69","article-title":"Mining the benefits of two-stage and one-stage hoi detection,","volume":"34","author":"Zhang","year":"2021","journal-title":"Advances in Neural Information Processing Systems"},{"key":"B70","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01947","article-title":"Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer","author":"Zhang","year":"","journal-title":"arXiv preprint"},{"key":"B71","first-page":"13319","article-title":"Spatially conditioned graphs for detecting human-object interactions,","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Zhang","year":""},{"key":"B72","first-page":"19548","article-title":"Exploring structure-aware transformer over interaction proposals for human-object interaction detection,","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zhang","year":"2022"},{"key":"B73","first-page":"11825","article-title":"End-to end human object interaction detection with hoi transformer,","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zou","year":"2021"}],"container-title":["Frontiers in Artificial Intelligence"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frai.2023.1084740\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,10,13]],"date-time":"2024-10-13T06:59:54Z","timestamp":1728802794000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frai.2023.1084740\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,1,30]]},"references-count":73,"alternative-id":["10.3389\/frai.2023.1084740"],"URL":"https:\/\/doi.org\/10.3389\/frai.2023.1084740","relation":{},"ISSN":["2624-8212"],"issn-type":[{"value":"2624-8212","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,1,30]]},"article-number":"1084740"}}