{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,29]],"date-time":"2026-05-29T11:58:52Z","timestamp":1780055932985,"version":"3.54.0"},"reference-count":47,"publisher":"MDPI AG","issue":"17","license":[{"start":{"date-parts":[[2021,8,24]],"date-time":"2021-08-24T00:00:00Z","timestamp":1629763200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100004837","name":"Ministerio de Ciencia e Innovaci\u00f3n","doi-asserted-by":"publisher","award":["DPI2017-90035-R"],"award-info":[{"award-number":["DPI2017-90035-R"]}],"id":[{"id":"10.13039\/501100004837","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100012818","name":"Comunidad de Madrid","doi-asserted-by":"publisher","award":["S2018\/EMT-4362 SEGVAUTO 4.0-CM"],"award-info":[{"award-number":["S2018\/EMT-4362 SEGVAUTO 4.0-CM"]}],"id":[{"id":"10.13039\/100012818","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100006302","name":"Universidad de Alcal\u00e1","doi-asserted-by":"publisher","award":["30400M000.541.A 640.06"],"award-info":[{"award-number":["30400M000.541.A 640.06"]}],"id":[{"id":"10.13039\/501100006302","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Anticipating pedestrian crossing behavior in urban scenarios is a challenging task for autonomous vehicles. Early this year, a benchmark comprising JAAD and PIE datasets have been released. In the benchmark, several state-of-the-art methods have been ranked. However, most of the ranked temporal models rely on recurrent architectures. In our case, we propose, as far as we are concerned, the first self-attention alternative, based on transformer architecture, which has had enormous success in natural language processing (NLP) and recently in computer vision. Our architecture is composed of various branches which fuse video and kinematic data. The video branch is based on two possible architectures: RubiksNet and TimeSformer. The kinematic branch is based on different configurations of transformer encoder. Several experiments have been performed mainly focusing on pre-processing input data, highlighting problems with two kinematic data sources: pose keypoints and ego-vehicle speed. Our proposed model results are comparable to PCPA, the best performing model in the benchmark reaching an F1 Score of nearly 0.78 against 0.77. Furthermore, by using only bounding box coordinates and image data, our model surpasses PCPA by a larger margin (F1=0.75 vs. F1=0.72). Our model has proven to be a valid alternative to recurrent architectures, providing advantages such as parallelization and whole sequence processing, learning relationships between samples not possible with recurrent architectures.<\/jats:p>","DOI":"10.3390\/s21175694","type":"journal-article","created":{"date-parts":[[2021,8,24]],"date-time":"2021-08-24T22:09:39Z","timestamp":1629842979000},"page":"5694","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":53,"title":["CAPformer: Pedestrian Crossing Action Prediction Using Transformer"],"prefix":"10.3390","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6350-2460","authenticated-orcid":false,"given":"Javier","family":"Lorenzo","sequence":"first","affiliation":[{"name":"INVETT Research Group, Universidad de Alcal\u00e1, Campus Universitario, Ctra, Madrid-Barcelona km, 33, 600, 28805 Alcal\u00e1 de Henares, Spain"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3889-018X","authenticated-orcid":false,"given":"Ignacio Parra","family":"Alonso","sequence":"additional","affiliation":[{"name":"INVETT Research Group, Universidad de Alcal\u00e1, Campus Universitario, Ctra, Madrid-Barcelona km, 33, 600, 28805 Alcal\u00e1 de Henares, Spain"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6722-3036","authenticated-orcid":false,"given":"Rub\u00e9n","family":"Izquierdo","sequence":"additional","affiliation":[{"name":"INVETT Research Group, Universidad de Alcal\u00e1, Campus Universitario, Ctra, Madrid-Barcelona km, 33, 600, 28805 Alcal\u00e1 de Henares, Spain"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6688-5081","authenticated-orcid":false,"given":"Augusto Luis","family":"Ballardini","sequence":"additional","affiliation":[{"name":"INVETT Research Group, Universidad de Alcal\u00e1, Campus Universitario, Ctra, Madrid-Barcelona km, 33, 600, 28805 Alcal\u00e1 de Henares, Spain"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3779-6474","authenticated-orcid":false,"given":"\u00c1lvaro Hern\u00e1ndez","family":"Saz","sequence":"additional","affiliation":[{"name":"INVETT Research Group, Universidad de Alcal\u00e1, Campus Universitario, Ctra, Madrid-Barcelona km, 33, 600, 28805 Alcal\u00e1 de Henares, Spain"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2433-7110","authenticated-orcid":false,"given":"David Fern\u00e1ndez","family":"Llorca","sequence":"additional","affiliation":[{"name":"INVETT Research Group, Universidad de Alcal\u00e1, Campus Universitario, Ctra, Madrid-Barcelona km, 33, 600, 28805 Alcal\u00e1 de Henares, Spain"},{"name":"Joint Research Center, European Commission, 41092 Seville, Spain"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8809-2103","authenticated-orcid":false,"given":"Miguel \u00c1ngel","family":"Sotelo","sequence":"additional","affiliation":[{"name":"INVETT Research Group, Universidad de Alcal\u00e1, Campus Universitario, Ctra, Madrid-Barcelona km, 33, 600, 28805 Alcal\u00e1 de Henares, Spain"},{"name":"INVETT Research Group, Colegio de San Ildefonso, Universidad de Alcal\u00e1, Plaza de San Diego s\/n, 28801 Alcal\u00e1 de Henares, Spain"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2021,8,24]]},"reference":[{"key":"ref_1","unstructured":"World Health Organization (2018). Global Status Report on Road Safety 2018, World Health Organization."},{"key":"ref_2","unstructured":"Adminait\u00e9-Fodor, D., and Jost, G. (2019). Safer Roads, Safer Cities: How to Improve Urban Road Safety in The EU, European Transport Safety Council. Technical Report."},{"key":"ref_3","unstructured":"(2020). European New Car Assessment Programme (Euro NCAP) Test Protocol-AEB VRU Systems, Euro NCAP. Technical Report."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"895","DOI":"10.1177\/0278364920917446","article-title":"Human motion trajectory prediction: A survey","volume":"39","author":"Rudenko","year":"2020","journal-title":"Int. J. Robot. Res."},{"key":"ref_5","unstructured":"Rasouli, A., Kotseruba, I., and Tsotsos, J.K. (2020). Pedestrian Action Anticipation Using Contextual Feature Fusion in Stacked RNNs. arXiv."},{"key":"ref_6","unstructured":"Zhu, Y., Li, X., Liu, C., Zolfaghari, M., Xiong, Y., Wu, C., Zhang, Z., Tighe, J., Manmatha, R., and Li, M. (2020). A Comprehensive Study of Deep Video Action Recognition. arXiv."},{"key":"ref_7","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017). Attention Is All You Need. arXiv."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Kotseruba, I., Rasouli, A., and Tsotsos, J.K. (2021, January 5\u20139). Benchmark for Evaluating Pedestrian Action Prediction. Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV).","DOI":"10.1109\/WACV48630.2021.00130"},{"key":"ref_9","unstructured":"Rasouli, A., and Tsotsos, J.K. (2018). Joint Attention in Driver-Pedestrian Interaction: From Theory to Practice. arXiv."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Rasouli, A., Kotseruba, I., and Tsotsos, J.K. (2017, January 22\u201329). Are They Going to Cross? A Benchmark Dataset and Baseline for Pedestrian Crosswalk Behavior. Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy.","DOI":"10.1109\/ICCVW.2017.33"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Fang, Z., and L\u00f3pez, A.M. (2018). Is the Pedestrian going to Cross? Answering by 2D Pose Estimation. arXiv.","DOI":"10.1109\/IVS.2018.8500413"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Gesnouin, J., Pechberti, S., Bresson, G., Stanciulescu, B., and Moutarde, F. (2020). Predicting Intentions of Pedestrians from 2D Skeletal Pose Sequences with a Representation-Focused Multi-Branch Deep Learning Network. Algorithms, 13.","DOI":"10.3390\/a13120331"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Cadena, P.R.G., Yang, M., Qian, Y., and Wang, C. (2019, January 27\u201330). Pedestrian Graph: Pedestrian Crossing Prediction Based on 2D Pose Estimation and Graph Convolutional Networks. Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference, ITSC, Auckland, New Zealand.","DOI":"10.1109\/ITSC.2019.8917118"},{"key":"ref_14","unstructured":"Ait Bouhsain, S., and Alahi, A. (2020). Pedestrian Intention Prediction: A Multi-Task Perspective. Technical Report. arXiv."},{"key":"ref_15","unstructured":"Lorenzo, J., Parra, I., Wirth, F., Stiller, C., Llorca, D.F., and Sotelo, M.A. (November, January 19). RNN-based Pedestrian Crossing Prediction using Activity and Pose-related Features. Proceedings of the IEEE Intelligent Vehicles Symposium, Las Vegas, NV, USA."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Ghori, O., MacKowiak, R., Bautista, M., Beuter, N., Drumond, L., DIego, F., and Ommer, B.B. (2018, January 26\u201330). Learning to Forecast Pedestrian Intention from Pose Dynamics. Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China.","DOI":"10.1109\/IVS.2018.8500657"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Ranga, A., Giruzzi, F., Bhanushali, J., Wirbel, E., P\u00e9rez, P., Vu, T.H., and Perrotton, X. (2020). VRUNet: Multi-Task Learning Model for Intent Prediction of Vulnerable Road Users. IS T Int. Symp. Electron. Imaging Sci. Technol., 2020.","DOI":"10.2352\/ISSN.2470-1173.2020.16.AVM-109"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"149318","DOI":"10.1109\/ACCESS.2019.2944792","article-title":"Multi-Task Deep Learning for Pedestrian Detection, Action Recognition and Time to Cross Prediction","volume":"7","author":"Pop","year":"2019","journal-title":"IEEE Access"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Saleh, K., Hossny, M., and Nahavandi, S. (2019). Real-time Intent Prediction of Pedestrians for Autonomous Ground Vehicles via Spatio-Temporal DenseNet. arXiv.","DOI":"10.1109\/ICRA.2019.8793991"},{"key":"ref_20","unstructured":"Yang, B., Zhan, W., Wang, P., Chan, C., Cai, Y., and Wang, N. (2021). Crossing or Not? Context-Based Recognition of Pedestrian Crossing Intention in the Urban Environment. IEEE Trans. Intell. Transp. Syst., 1\u201312."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Piccoli, F., Balakrishnan, R., Perez, M.J., Sachdeo, M., Nunez, C., Tang, M., Andreasson, K., Bjurek, K., Raj, R.D., and Davidsson, E. (2020). FuSSI-Net: Fusion of Spatio-temporal Skeletons for Intention Prediction Network. arXiv.","DOI":"10.1109\/IEEECONF51394.2020.9443552"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Gujjar, P., and Vaughan, R. (2019, January 20\u201324). Classifying pedestrian actions in advance using predicted video of urban driving scenes. Proceedings of the IEEE International Conference on Robotics and Automation, Montreal, QC, Canada.","DOI":"10.1109\/ICRA.2019.8794278"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Chaabane, M., Trabelsi, A., Blanchard, N., and Beveridge, R. (2020, January 1\u20135). Looking ahead: Anticipating pedestrians crossing with future frames prediction. Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision, WACV, Snowmass, CO, USA.","DOI":"10.1109\/WACV45572.2020.9093426"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Malla, S., Dariush, B., and Choi, C. (2020, January 14\u201319). TITAN: Future Forecast using Action Priors. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR).","DOI":"10.1109\/CVPR42600.2020.01120"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"3485","DOI":"10.1109\/LRA.2020.2976305","article-title":"Spatiotemporal Relationship Reasoning for Pedestrian Intent Prediction","volume":"5","author":"Liu","year":"2020","journal-title":"IEEE Robot. Autom. Lett."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Rasouli, A., Kotseruba, I., Kunic, T., and Tsotsos, J.K. (November, January 27). PIE: A Large-Scale Dataset and Models for Pedestrian Intention Estimation and Trajectory Prediction. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00636"},{"key":"ref_27","unstructured":"Rasouli, A., Yau, T., Lakner, P., Malekmohammadi, S., Rohani, M., and Luo, J. (2020). PePScenes: A Novel Dataset and Baseline for Pedestrian Action Prediction in 3D. arXiv."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., and Beijbom, O. (2019). nuScenes: A Multimodal Dataset for Autonomous Driving. arXiv.","DOI":"10.1109\/CVPR42600.2020.01164"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Yau, T., Malekmohammadi, S., Rasouli, A., Lakner, P., Rohani, M., and Luo, J. (2020). Graph-SIM: A Graph-based Spatiotemporal Interaction Modelling for Pedestrian Action Prediction. arXiv.","DOI":"10.1109\/ICRA48506.2021.9561107"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Yang, D., Zhang, H., Yurtsever, E., Redmill, K., and \u00d6zg\u00fcner, \u00dc. (2021). Predicting Pedestrian Crossing Intention with Feature Fusion and Spatio-Temporal Attention. arXiv.","DOI":"10.1109\/TIV.2022.3162719"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Fan, L., Buch, S., Wang, G., Cao, R., Zhu, Y., Niebles, J.C., and Fei-Fei, L. (2020, January 23\u201328). RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition. Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK.","DOI":"10.1007\/978-3-030-58529-7_30"},{"key":"ref_32","unstructured":"Bertasius, G., Wang, H., and Torresani, L. Is Space-Time Attention All You Need for Video Understanding? In Proceedings of the International Conference on Machine Learning (ICML), 18\u201324 July 2021."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Khan, S., Naseer, M., Hayat, M., Waqas Zamir, S., Shahbaz Khan, F., and Shah, M. (2021). Transformers in Vision: A Survey. arXiv.","DOI":"10.1145\/3505244"},{"key":"ref_34","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., and Hovy, E. (2016, January 12\u201317). Hierarchical attention networks for document classification. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT), San Diego, CA, USA.","DOI":"10.18653\/v1\/N16-1174"},{"key":"ref_36","unstructured":"Loshchilov, I., and Hutter, F. (2019). Decoupled Weight Decay Regularization. arXiv."},{"key":"ref_37","unstructured":"(2021, August 15). Falcon, WA, e.a. PyTorch Lightning. GitHub. Available online: https:\/\/github.com\/PyTorchLightning\/pytorch-lightning."},{"key":"ref_38","unstructured":"Wallach, H., Larochelle, H., Beygelzimer, A., d\u2019Alch\u00e9-Buc, F., Fox, E., and Garnett, R. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing Systems 32 (NeurIPS), Curran Associates, Inc."},{"key":"ref_39","unstructured":"Biewald, L. (2021, August 15). Experiment Tracking with Weights and Biases. Available online: wandb.com."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Goyal, R., Kahou, S.E., Michalski, V., Materzy\u0144ska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., and Mueller-Freitag, M. (2017). The \u201cSomething Something\u201d Video Database for Learning and Evaluating Visual Common Sense. arXiv.","DOI":"10.1109\/ICCV.2017.622"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Carreira, J., and Zisserman, A. (2018). Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. arXiv.","DOI":"10.1109\/CVPR.2017.502"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., and Sivic, J. (November, January 27). HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00272"},{"key":"ref_43","unstructured":"Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., and Zisserman, A. (2018). A Short Note about Kinetics-600. arXiv."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Bhattacharyya, A., Fritz, M., and Schiele, B. (2018). Long-Term On-Board Prediction of People in Traffic Scenes under Uncertainty. arXiv.","DOI":"10.1109\/CVPR.2018.00441"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015, January 7\u201313). Learning Spatiotemporal Features with 3D Convolutional Networks. Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile.","DOI":"10.1109\/ICCV.2015.510"},{"key":"ref_46","first-page":"2825","article-title":"Scikit-learn: Machine Learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J. Mach. Learn. Res."},{"key":"ref_47","unstructured":"Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., and Koltun, V. (2017, January 13\u201315). CARLA: An Open Urban Driving Simulator. Proceedings of the 1st Annual Conference on Robot Learning (CoRL), Mountain View, CA, USA."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/17\/5694\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:50:39Z","timestamp":1760165439000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/17\/5694"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,8,24]]},"references-count":47,"journal-issue":{"issue":"17","published-online":{"date-parts":[[2021,9]]}},"alternative-id":["s21175694"],"URL":"https:\/\/doi.org\/10.3390\/s21175694","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,8,24]]}}}