{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T09:11:44Z","timestamp":1774602704637,"version":"3.50.1"},"reference-count":44,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2026,2,21]],"date-time":"2026-02-21T00:00:00Z","timestamp":1771632000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,2,21]],"date-time":"2026-02-21T00:00:00Z","timestamp":1771632000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2026,3]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    Unlike the sparse label action detection task, where a single action occurs in each timestamp of a video, in a dense multi-label scenario, actions can overlap temporally. To address this challenging task, it is necessary to simultaneously learn (i) co-occurrence action relationships and (ii) temporal dependencies. Current methods model co-occurrence action relationships by explicitly embedding class relations into the transformer network architecture. However, these approaches are not computationally efficient, as the network needs to compute all possible pair action class relations. In this paper, we overcome this by introducing a novel framework trained through a novel learning paradigm that allows the network to benefit from explicitly modelling temporal co-occurrence action dependencies during training without imposing their computational overhead during inference. Furthermore, to model temporal information, recent approaches extract multi-scale temporal features through hierarchical transformer-based networks. However, the self-attention mechanism in transformers inherently loses temporal positional information. We argue that combining this with multiple sub-sampling processes in hierarchical designs can lead to further loss of positional information. Preserving this information is essential for accurate action detection. In this paper, we address this issue by proposing a novel transformer network that (a) employs a non-hierarchical structure when modelling different ranges of temporal dependencies and (b) embeds relative positional encoding in its transformer layers. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets and show that our method improves the current state-of-the-art results by 1.1% and 0.6% per-frame mAP on the Charades and MultiTHUMOS datasets, respectively, achieving new state-of-the-art per-frame mAP results at 26.5% and 44.6%, respectively. We also performed extensive ablation studies to examine the impact of the different components of our proposed approach.\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/github.com\/faeghehsardari\/E-E-IJCV\" ext-link-type=\"uri\">\n                      <jats:underline>Our code will be released upon paper publication<\/jats:underline>\n                    <\/jats:ext-link>\n                  <\/jats:p>","DOI":"10.1007\/s11263-026-02738-x","type":"journal-article","created":{"date-parts":[[2026,2,21]],"date-time":"2026-02-21T07:14:50Z","timestamp":1771658090000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["An Effective-Efficient Approach for Dense Multi-Label Action Detection"],"prefix":"10.1007","volume":"134","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9134-0427","authenticated-orcid":false,"given":"Faegheh","family":"Sardari","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Armin","family":"Mustafa","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Philip J. B.","family":"Jackson","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Adrian","family":"Hilton","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2026,2,21]]},"reference":[{"key":"2738_CR1","doi-asserted-by":"crossref","unstructured":"Badamdorj, T., Rochan, M., Wang, Y., & Cheng, L. (2022). Contrastive Learning for Unsupervised Video Highlight Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 14042\u201314052).","DOI":"10.1109\/CVPR52688.2022.01365"},{"key":"2738_CR2","doi-asserted-by":"crossref","unstructured":"Caba Heilbron, F., Escorcia, V., Ghanem, B., & Carlos Niebles, J. (2015). ActivityNet: A large-Scale Video Benchmark for Human Activity Understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 961\u2013970).","DOI":"10.1109\/CVPR.2015.7298698"},{"key":"2738_CR3","doi-asserted-by":"crossref","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-End Object Detection with Transformers. In: European conference on computer vision (pp. 213\u2013229). Springer.","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"2738_CR4","doi-asserted-by":"crossref","unstructured":"Carreira, J., & Zisserman, A. (2017). Quo vadis, Action Recognition? a New Model and the Kinetics Dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299\u20136308).","DOI":"10.1109\/CVPR.2017.502"},{"key":"2738_CR5","doi-asserted-by":"crossref","unstructured":"Chao, Y. W., Vijayanarasimhan, S., Seybold, B., Ross, D. A., Deng, J., & Sukthankar, R. (2018). Rethinking the Faster R-CNN Architecture for Temporal Action Localization. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1130\u20131139).","DOI":"10.1109\/CVPR.2018.00124"},{"key":"2738_CR6","doi-asserted-by":"crossref","unstructured":"Dai, R., Das, S., & Bremond, F. (2021). Ctrn: Class-temporal relational network for action detection. British Machine Vision Conference","DOI":"10.5244\/C.35.37"},{"key":"2738_CR7","doi-asserted-by":"crossref","unstructured":"Dai, R., Das, S., Minciullo, L., Garattoni, L., Francesca, G., & Bremond, F. (2021). PDAN: Pyramid Dilated Attention Network for Action Detection. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (pp. 2970\u20132979).","DOI":"10.1109\/WACV48630.2021.00301"},{"key":"2738_CR8","doi-asserted-by":"crossref","unstructured":"Dai, R., Das, S., Kahatapitiya, K., Ryoo, M. S., & Br\u00e9mond, F. (2022). MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 20041\u201320051).","DOI":"10.1109\/CVPR52688.2022.01941"},{"key":"2738_CR9","doi-asserted-by":"crossref","unstructured":"Dai, X., Singh, B., Ng, J. Y. H., & Davis, L. (2019). TAN: Temporal Aggregation Network for Dense Multi-Label Action Recognition. In: 2019 IEEE Winter Conference on Applications of Computer Vision (pp. 151\u2013160). IEEE.","DOI":"10.1109\/WACV.2019.00022"},{"issue":"3","key":"2738_CR10","doi-asserted-by":"publisher","first-page":"733","DOI":"10.1162\/coli_a_00445","volume":"48","author":"P Dufter","year":"2022","unstructured":"Dufter, P., Schmitt, M., & Sch\u00fctze, H. (2022). Position Information in Transformers: An Overview. Computational Linguistics, 48(3), 733\u2013763.","journal-title":"Computational Linguistics"},{"key":"2738_CR11","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C. (2020). X3D: Expanding Architectures for Efficient Video Recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 203\u2013213).","DOI":"10.1109\/CVPR42600.2020.00028"},{"key":"2738_CR12","doi-asserted-by":"crossref","unstructured":"Foo, L. G., Li, T., Rahmani, H., & Liu, J. (2024). Action detection via an image diffusion process. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 18351\u201318361).","DOI":"10.1109\/CVPR52733.2024.01737"},{"key":"2738_CR13","first-page":"6840","volume":"33","author":"J Ho","year":"2020","unstructured":"Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840\u20136851.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2738_CR14","unstructured":"Huang, C. Z. A., Vaswani, A., Uszkoreit, J. Shazeer, N., Simon, I., Hawthorne, C., Dai, A. M Hoffman, M. D. Dinculescu, M. & Eck, D. (2019). Music Transformer. International Conference on Learning Representations."},{"key":"2738_CR15","unstructured":"Jiang, Y. G., Liu, J., Zamir, A. R., Toderici, G., Laptev, I., Shah, M., & Sukthankar, R. (2014). THUMOS Challenge: Action Recognition with a Large Number of Classes."},{"key":"2738_CR16","doi-asserted-by":"crossref","unstructured":"Kahatapitiya, K., & Ryoo, M. S. (2021). Coarse-Fine Networks for Temporal Activity Detection in Videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 8385\u20138394).","DOI":"10.1109\/CVPR46437.2021.00828"},{"key":"2738_CR17","unstructured":"Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., ... & Zisserman, A. (2017). The Kinetics Human Action Video Dataset. arXiv preprint arXiv:1705.06950"},{"key":"2738_CR18","doi-asserted-by":"crossref","unstructured":"Kim, J., Lee, M., & Heo, J. P. (2023). Self-feedback detr for temporal action detection. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision (pp. 10286\u201310296).","DOI":"10.1109\/ICCV51070.2023.00944"},{"key":"2738_CR19","first-page":"15816","volume":"34","author":"Y Li","year":"2021","unstructured":"Li, Y., Si, S., Li, G., Hsieh, C. J., & Bengio, S. (2021). Learnable Fourier Features for Multi-Dimensional Spatial Positional Encoding. Advances in Neural Information Processing Systems, 34, 15816\u201315829.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2738_CR20","doi-asserted-by":"crossref","unstructured":"Li, Z., & Yao, L. (2021). Three Birds with One Stone: Multi-Task Temporal Action Detection via Recycling Temporal Annotations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 4751\u20134760).","DOI":"10.1109\/CVPR46437.2021.00472"},{"key":"2738_CR21","doi-asserted-by":"crossref","unstructured":"Lin, T., Zhao, X., Su, H., Wang, C., & Yang, M. (2018). BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. In: Proceedings of the European Conference on Computer Vision (pp. 3\u201319).","DOI":"10.1007\/978-3-030-01225-0_1"},{"key":"2738_CR22","first-page":"11449","volume":"34","author":"YB Lin","year":"2021","unstructured":"Lin, Y. B., Tseng, H. Y., Lee, H. Y., Lin, Y. Y., & Yang, M. H. (2021). Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing. Advances in Neural Information Processing Systems, 34, 11449\u201311461.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2738_CR23","doi-asserted-by":"crossref","unstructured":"Liu, Y., Ma, L., Zhang, Y., Liu, W., & Chang, S. F. (2019). Multi-Granularity Generator for Temporal Action Proposal. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3604\u20133613).","DOI":"10.1109\/CVPR.2019.00372"},{"key":"2738_CR24","doi-asserted-by":"crossref","unstructured":"Min, J., Buch, S., Nagrani, A., Cho, M., & Schmid, C. (2024). MoReVQA: Exploring Modular Reasoning Models for Video Question Answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 13235\u201313245).","DOI":"10.1109\/CVPR52733.2024.01257"},{"key":"2738_CR25","doi-asserted-by":"crossref","unstructured":"Nag, S., Zhu, X., Deng, J., Song, Y. Z., & Xiang, T. (2023). Difftad: Temporal action detection with proposal denoising diffusion. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision (pp. 10362\u201310374).","DOI":"10.1109\/ICCV51070.2023.00951"},{"key":"2738_CR26","doi-asserted-by":"crossref","unstructured":"Ntinou, I., Sanchez, E., & Tzimiropoulos, G. (2024). Multiscale vision transformers meet bipartite matching for efficient single-stage action localization. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 18827\u201318836).","DOI":"10.1109\/CVPR52733.2024.01781"},{"key":"2738_CR27","unstructured":"Piergiovanni, A., & Ryoo, M. (2019). Temporal Gaussian Mixture Layer for Videos. In: International Conference on Machine learning (pp. 5152\u20135161). PMLR."},{"key":"2738_CR28","doi-asserted-by":"crossref","unstructured":"Piergiovanni, A., & Ryoo, M. S. (2018). Learning Latent Super-Events to Detect Multiple Activities in Videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5304\u20135313).","DOI":"10.1109\/CVPR.2018.00556"},{"key":"2738_CR29","doi-asserted-by":"crossref","unstructured":"Ridnik, T., Ben-Baruch, E., Zamir, N., Noy, A., Friedman, I., Protter, M., & Zelnik-Manor, L. (2021). Asymmetric Loss For Multi-Label Classification. In: Proceedings of the IEEE International Conference on Computer Vision, IEEE Computer Society (pp. 82\u201391).","DOI":"10.1109\/ICCV48922.2021.00015"},{"key":"2738_CR30","doi-asserted-by":"crossref","unstructured":"Sardari, F., Mustafa, A., Jackson, P. J., & Hilton, A. (2023). Pat: Position-aware transformer for dense multi-label action detection. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV) Workshops (pp. 2988\u20132997).","DOI":"10.1109\/ICCVW60793.2023.00321"},{"key":"2738_CR31","doi-asserted-by":"crossref","unstructured":"Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Self-Attention with Relative Position Representations. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2.","DOI":"10.18653\/v1\/N18-2074"},{"key":"2738_CR32","doi-asserted-by":"crossref","unstructured":"Shen, T., Zhou, T., Long, G., Jiang, J., Pan, S., & Zhang, C. (2018). DiSAN: Directional Self-Attention Network for RNN\/CNN-Free Language Understanding. In: Proceedings of the AAAI Conference on Artificial Intelligence.","DOI":"10.1609\/aaai.v32i1.11941"},{"key":"2738_CR33","doi-asserted-by":"crossref","unstructured":"Shi, D., Zhong, Y., Cao, Q., Zhang, J., Ma, L., Li, J., & Tao, D. (2022). React: Temporal action detection with relational queries. In: European conference on computer vision (pp. 105\u2013121). Springer.","DOI":"10.1007\/978-3-031-20080-9_7"},{"key":"2738_CR34","doi-asserted-by":"crossref","unstructured":"Sigurdsson, G. A., Varol, G., Wang, X., Farhadi, A., Laptev, I., & Gupta, A. (2016). Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. In: Proceedings of the European Conference on Computer Vision (pp. 510\u2013526).","DOI":"10.1007\/978-3-319-46448-0_31"},{"key":"2738_CR35","doi-asserted-by":"crossref","unstructured":"Tan, J., Zhao, X., Shi, X., Kang, B., & Wang, L. (2022). PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points. In: Advances in Neural Information Processing Systems,35, 15268\u201315280.","DOI":"10.52202\/068431-1111"},{"key":"2738_CR36","doi-asserted-by":"crossref","unstructured":"Tirupattur, P., Duarte, K., Rawat, Y. S., & Shah, M. (2021). Modeling Multi-Label Action Dependencies for Temporal Action Localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1460\u20131470).","DOI":"10.1109\/CVPR46437.2021.00151"},{"issue":"4","key":"2738_CR37","doi-asserted-by":"publisher","first-page":"4302","DOI":"10.1109\/TPAMI.2022.3193611","volume":"45","author":"E Vahdani","year":"2022","unstructured":"Vahdani, E., & Tian, Y. (2022). Deep Learning-based Action Detection in Untrimmed Videos: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4), 4302\u20134320.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"2738_CR38","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems 30."},{"key":"2738_CR39","doi-asserted-by":"crossref","unstructured":"Xiao, Y., Luo, Z., Liu, Y., Ma, Y., Bian, H., Ji, Y., ... & Li, X. (2024). Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 18709\u201318719).","DOI":"10.1109\/CVPR52733.2024.01770"},{"key":"2738_CR40","doi-asserted-by":"crossref","unstructured":"Xu, H., Das, A., & Saenko, K. (2017). R-C3D: Region Convolutional 3D Network for Temporal Activity Detection. In: Proceedings of the IEEE International Conference on Computer Vision (pp. 5783\u20135792).","DOI":"10.1109\/ICCV.2017.617"},{"key":"2738_CR41","doi-asserted-by":"crossref","unstructured":"Yang, Z., Wang, R., Tan, Y., & Xie, L. (2024). Malt: Multi-scale action learning transformer for online action detection. In: 2024 International Joint Conference on Neural Networks (IJCNN) (pp. 1\u20138). IEEE.","DOI":"10.1109\/IJCNN60899.2024.10650511"},{"key":"2738_CR42","doi-asserted-by":"crossref","unstructured":"Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., & Fei-Fei, L. (2018). Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. International Journal of Computer Vision, 126(2), 375\u2013389.","DOI":"10.1007\/s11263-017-1013-y"},{"key":"2738_CR43","doi-asserted-by":"crossref","unstructured":"Zhang, C. L., Wu, J., & Li, Y. (2022). Actionformer: Localizing moments of actions with transformers. In: European Conference on Computer Vision (pp. 492\u2013510). Springer.","DOI":"10.1007\/978-3-031-19772-7_29"},{"key":"2738_CR44","doi-asserted-by":"crossref","unstructured":"Zhu, Y., Zhang, G., Tan, J., Wu, G., & Wang, L. (2024). Dual DETRs for Multi-Label Temporal Action Detection. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 18559\u201318569).","DOI":"10.1109\/CVPR52733.2024.01756"}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-026-02738-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-026-02738-x","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-026-02738-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T08:35:45Z","timestamp":1774600545000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-026-02738-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,21]]},"references-count":44,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2026,3]]}},"alternative-id":["2738"],"URL":"https:\/\/doi.org\/10.1007\/s11263-026-02738-x","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,21]]},"assertion":[{"value":"22 May 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 January 2026","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"21 February 2026","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"124"}}