{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,22]],"date-time":"2026-04-22T12:20:16Z","timestamp":1776860416740,"version":"3.51.2"},"reference-count":49,"publisher":"Springer Science and Business Media LLC","issue":"12","license":[{"start":{"date-parts":[[2024,7,16]],"date-time":"2024-07-16T00:00:00Z","timestamp":1721088000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,7,16]],"date-time":"2024-07-16T00:00:00Z","timestamp":1721088000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62176020"],"award-info":[{"award-number":["62176020"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100012165","name":"Key Technologies Research and Development Program","doi-asserted-by":"publisher","award":["2020AAA0106800"],"award-info":[{"award-number":["2020AAA0106800"]}],"id":[{"id":"10.13039\/501100012165","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004826","name":"Natural Science Foundation of Beijing Municipality","doi-asserted-by":"publisher","award":["Z180006"],"award-info":[{"award-number":["Z180006"]}],"id":[{"id":"10.13039\/501100004826","id-type":"DOI","asserted-by":"publisher"}]},{"name":"CAAI-Huawei MindSpore Open Fund and Chinese Academy of Sciences","award":["OEIP-O-202004"],"award-info":[{"award-number":["OEIP-O-202004"]}]},{"DOI":"10.13039\/501100015342","name":"Key Laboratory of Road Traffic Safety Ministry of Public Security","doi-asserted-by":"publisher","award":["RCS2023K006"],"award-info":[{"award-number":["RCS2023K006"]}],"id":[{"id":"10.13039\/501100015342","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2024,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>We propose a unified object-aware temporal learning framework for multi-view 3D detection and tracking tasks. Having observed that the efficacy of the temporal fusion strategy in recent multi-view perception methods may be weakened by distractors and background clutters in historical frames, we propose a cyclic learning mechanism to improve the robustness of multi-view representation learning. The essence is constructing a backward bridge to propagate information from model predictions (<jats:italic>e.g.,<\/jats:italic> object locations and sizes) to image and BEV features, which forms a circle with regular inference. After backward refinement, the responses of target-irrelevant regions in historical frames would be suppressed, decreasing the risk of polluting future frames and improving the object awareness ability of temporal fusion. We further tailor an object-aware association strategy for tracking based on the cyclic learning model. The cyclic learning model not only provides refined features, but also delivers finer clues (<jats:italic>e.g.,<\/jats:italic> scale level) for tracklet association. The proposed cycle learning method and association module together contribute a novel and unified multi-task framework. Experiments on nuScenes show that the proposed model achieves consistent performance gains over baselines of different designs (<jats:italic>i.e.,<\/jats:italic> dense query-based BEVFormer, sparse query-based SparseBEV and LSS-based BEVDet4D) on both detection and tracking evaluation. Codes and models will be released.<\/jats:p>","DOI":"10.1007\/s11263-024-02176-7","type":"journal-article","created":{"date-parts":[[2024,7,16]],"date-time":"2024-07-16T06:01:51Z","timestamp":1721109711000},"page":"6184-6206","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Cyclic Refiner: Object-Aware Temporal Representation Learning for Multi-view 3D Detection and Tracking"],"prefix":"10.1007","volume":"132","author":[{"given":"Mingzhe","family":"Guo","sequence":"first","affiliation":[]},{"given":"Zhipeng","family":"Zhang","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7578-3407","authenticated-orcid":false,"given":"Liping","family":"Jing","sequence":"additional","affiliation":[]},{"given":"Yuan","family":"He","sequence":"additional","affiliation":[]},{"given":"Ke","family":"Wang","sequence":"additional","affiliation":[]},{"given":"Heng","family":"Fan","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,7,16]]},"reference":[{"key":"2176_CR1","doi-asserted-by":"publisher","DOI":"10.4324\/9781315802084","volume-title":"Philosophy of mind: An overview for cognitive science","author":"W Bechtel","year":"2013","unstructured":"Bechtel, W. (2013). Philosophy of mind: An overview for cognitive science. London: Psychology Press."},{"key":"2176_CR2","doi-asserted-by":"crossref","unstructured":"Bhat, G., Danelljan, M., Gool, L.V., & Timofte, R. (2019). Learning discriminative model prediction for tracking. In: ICCV.","DOI":"10.1109\/ICCV.2019.00628"},{"key":"2176_CR3","doi-asserted-by":"crossref","unstructured":"Bolme, D.S., Beveridge, J.R., Draper, B.A., & Lui, Y.M. (2010). Visual object tracking using adaptive correlation filters. In: CVPR.","DOI":"10.1109\/CVPR.2010.5539960"},{"key":"2176_CR4","doi-asserted-by":"crossref","unstructured":"Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., & Beijbom, O. (2020). nuscenes: A multimodal dataset for autonomous driving. In: CVPR.","DOI":"10.1109\/CVPR42600.2020.01164"},{"key":"2176_CR5","doi-asserted-by":"crossref","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In: ECCV.","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"2176_CR6","unstructured":"Chaabane, M., Zhang, P., Beveridge, J.R., & O\u2019Hara, S. (2021). Deft: Detection embeddings for tracking. arXiv."},{"key":"2176_CR7","doi-asserted-by":"crossref","unstructured":"Cui, Y., Jiang, C., Wang, L., & Wu, G. (2022). Fully convolutional online tracking. Computer Vision and Image Understanding.","DOI":"10.1016\/j.cviu.2022.103547"},{"key":"2176_CR8","doi-asserted-by":"crossref","unstructured":"Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In: ICCV.","DOI":"10.1109\/ICCV.2017.89"},{"key":"2176_CR9","doi-asserted-by":"crossref","unstructured":"Danelljan, M., Bhat, G., Khan, F.S., & Felsberg, M. (2019). Atom: Accurate tracking by overlap maximization. In: CVPR.","DOI":"10.1109\/CVPR.2019.00479"},{"key":"2176_CR10","unstructured":"Fischer, T., Yang, Y.H., Kumar, S., et\u00a0al. (2022). Cc-3dt: Panoramic 3d object tracking via cross-camera fusion. NeurIPS."},{"key":"2176_CR11","doi-asserted-by":"crossref","unstructured":"He, K., Gkioxari, G., Doll\u00e1r, P., & Girshick, R. (2017). Mask r-cnn. In: ICCV.","DOI":"10.1109\/ICCV.2017.322"},{"key":"2176_CR12","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: CVPR.","DOI":"10.1109\/CVPR.2016.90"},{"key":"2176_CR13","doi-asserted-by":"crossref","unstructured":"Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"2176_CR14","doi-asserted-by":"crossref","unstructured":"Hu, H.N., Yang, Y.H., Fischer, T., Darrell, T., Yu, F., & Sun, M. (2022). Monocular quasi-dense 3d object tracking. TPAMI.","DOI":"10.1109\/TPAMI.2022.3168781"},{"key":"2176_CR15","unstructured":"Huang, B., Li, Y., Xie, E., Liang, F., Wang, L., Shen, M., Liu, F., Wang, T., Luo, P., & Shao, J. (2023). Fast-bev: Towards real-time on-vehicle bird\u2019s-eye view perception. arXiv."},{"key":"2176_CR16","unstructured":"Huang, J., & Huang, G. (2022). Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv."},{"key":"2176_CR17","unstructured":"Huang, J., Huang, G., Zhu, Z., & Du, D. (2021). Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv."},{"key":"2176_CR18","doi-asserted-by":"crossref","unstructured":"Jiang, Y., Zhang, L., Miao, Z., Zhu, X., Gao, J., Hu, W., & Jiang, Y.G. (2022). Polarformer: Multi-camera 3d object detection with polar transformers. arXiv.","DOI":"10.1609\/aaai.v37i1.25185"},{"key":"2176_CR19","doi-asserted-by":"crossref","unstructured":"Kuhn, H.W. (1955). The hungarian method for the assignment problem. Naval research logistics quarterly.","DOI":"10.1002\/nav.3800020109"},{"key":"2176_CR20","doi-asserted-by":"crossref","unstructured":"Li, Y., Bao, H., Ge, Z., Yang, J., Sun, J., & Li, Z. (2023). Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In: AAAI.","DOI":"10.1609\/aaai.v37i2.25234"},{"key":"2176_CR21","unstructured":"Li, Y., Chen, Y., Qi, X., et\u00a0al. (2022). Unifying voxel-based representation with transformer for 3d object detection. arXiv."},{"key":"2176_CR22","doi-asserted-by":"crossref","unstructured":"Li, Y., Ge, Z., Yu, G., Yang, J., Wang, Z., Shi, Y., Sun, J., & Li, Z. (2022). Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. arXiv.","DOI":"10.1609\/aaai.v37i2.25233"},{"key":"2176_CR23","doi-asserted-by":"crossref","unstructured":"Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Qiao, Y., & Dai, J. (2022). Bevformer: Learning bird\u2019s-eye-view representation from multi-camera images via spatiotemporal transformers. In: ECCV.","DOI":"10.1007\/978-3-031-20077-9_1"},{"key":"2176_CR24","doi-asserted-by":"crossref","unstructured":"Liang, C., Zhang, Z., Zhou, X., Li, B., & Hu, W. (2022). One more check: making \u201cfake background\u201d be tracked again. In: AAAI.","DOI":"10.1609\/aaai.v36i2.20045"},{"key":"2176_CR25","doi-asserted-by":"crossref","unstructured":"Liang, C., Zhang, Z., Zhou, X., Li, B., Zhu, S., & Hu, W. (2022). Rethinking the competition between detection and reid in multiobject tracking. TIP.","DOI":"10.1109\/TIP.2022.3165376"},{"key":"2176_CR26","doi-asserted-by":"crossref","unstructured":"Liu, H., Teng, Y., Lu, T., Wang, H., & Wang, L. (2023). Sparsebev: High-performance sparse 3d object detection from multi-camera videos. In: ICCV.","DOI":"10.1109\/ICCV51070.2023.01703"},{"key":"2176_CR27","doi-asserted-by":"crossref","unstructured":"Liu, Y., Wang, T., Zhang, X., & Sun, J. (2022). Petr: Position embedding transformation for multi-view 3d object detection. In: ECCV.","DOI":"10.1007\/978-3-031-19812-0_31"},{"key":"2176_CR28","doi-asserted-by":"crossref","unstructured":"Liu, Y., Yan, J., Jia, F., Li, S., Gao, Q., Wang, T., Zhang, X., & Sun, J. (2022). Petrv2: A unified framework for 3d perception from multi-camera images. arXiv.","DOI":"10.1109\/ICCV51070.2023.00302"},{"key":"2176_CR29","unstructured":"Pang, Z., Li, Z., & Wang, N. (2021). Simpletrack: Understanding and rethinking 3d multi-object tracking. arXiv."},{"key":"2176_CR30","unstructured":"Park, J., Xu, C., Yang, S., Keutzer, K., Kitani, K., Tomizuka, M., & Zhan, W. (2022). Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection. arXiv."},{"key":"2176_CR31","doi-asserted-by":"crossref","unstructured":"Philion, J., & Fidler, S. (2020). Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In: ECCV.","DOI":"10.1007\/978-3-030-58568-6_12"},{"key":"2176_CR32","doi-asserted-by":"crossref","unstructured":"Price, C.J. (1998). The functional anatomy of word comprehension and production. Trends in cognitive sciences.","DOI":"10.1016\/S1364-6613(98)01201-7"},{"key":"2176_CR33","doi-asserted-by":"crossref","unstructured":"Reading, C., Harakeh, A., Chae, J., & Waslander, S.L. (2021). Categorical depth distribution network for monocular 3d object detection. In: CVPR.","DOI":"10.1109\/CVPR46437.2021.00845"},{"key":"2176_CR34","unstructured":"Ren, S., He, K., Girshick, R., et\u00a0al. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. NeurIPS."},{"key":"2176_CR35","unstructured":"Shi, Y., Shen, J., Sun, Y., Wang, Y., Li, J., Sun, S., Jiang, K., & Yang, D. (2022). Srcn3d: Sparse r-cnn 3d surround-view camera object detection and tracking for autonomous driving. arXiv."},{"key":"2176_CR36","unstructured":"Wang, T., Xinge, Z., Pang, J., & Lin, D. (2022). Probabilistic and geometric depth: Detecting objects in perspective. In: CORL."},{"key":"2176_CR37","doi-asserted-by":"crossref","unstructured":"Wang, T., Zhu, X., Pang, J., & Lin, D. (2021). Fcos3d: Fully convolutional one-stage monocular 3d object detection. In: ICCV.","DOI":"10.1109\/ICCVW54120.2021.00107"},{"key":"2176_CR38","doi-asserted-by":"crossref","unstructured":"Wang, Y., Chen, Y., & Zhang, Z. (2023). Frustumformer: Adaptive instance-aware resampling for multi-view 3d detection. In: CVPR.","DOI":"10.1109\/CVPR52729.2023.00493"},{"key":"2176_CR39","unstructured":"Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., & Solomon, J. (2022). Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In: CORL."},{"key":"2176_CR40","doi-asserted-by":"crossref","unstructured":"Wang, Z., Huang, Z., Fu, J., Wang, N., & Liu, S. (2023). Object as query: Lifting any 2d object detector to 3d detection. In: ICCV.","DOI":"10.1109\/ICCV51070.2023.00351"},{"key":"2176_CR41","volume-title":"An introduction to the kalman filter","author":"G Welch","year":"1995","unstructured":"Welch, G., Bishop, G., et al. (1995). An introduction to the kalman filter. NC, USA: Chapel Hill."},{"key":"2176_CR42","unstructured":"Xie, E., Yu, Z., Zhou, D., et\u00a0al. (2022). M2bev: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation. arXiv."},{"key":"2176_CR43","doi-asserted-by":"crossref","unstructured":"Yang, F., Odashima, S., Masui, S., & Jiang, S. (2023). Hard to track objects with irregular motions and similar appearances? make it easier by buffering the matching space. In: WACV.","DOI":"10.1109\/WACV56688.2023.00478"},{"key":"2176_CR44","doi-asserted-by":"crossref","unstructured":"Zhang, T., Chen, X., Wang, Y., et\u00a0al. (2022). Mutr3d: A multi-camera tracking framework via 3d-to-2d queries. In: CVPR.","DOI":"10.1109\/CVPRW56347.2022.00500"},{"key":"2176_CR45","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Peng, H., Fu, J., Li, B., & Hu, W. (2020). Ocean: Object-aware anchor-free tracking. In: ECCV.","DOI":"10.1007\/978-3-030-58589-1_46"},{"key":"2176_CR46","doi-asserted-by":"crossref","unstructured":"Zhou, H., Ge, Z., Li, Z., & Zhang, X. (2022). Matrixvt: Efficient multi-camera to bev transformation for 3d perception. arXiv.","DOI":"10.1109\/ICCV51070.2023.00785"},{"key":"2176_CR47","doi-asserted-by":"crossref","unstructured":"Zhou, X., Koltun, V., & Kr\u00e4henb\u00fchl, P. (2020). Tracking objects as points. In: ECCV.","DOI":"10.1007\/978-3-030-58548-8_28"},{"key":"2176_CR48","unstructured":"Zhu, B., Jiang, Z., Zhou, X., Li, Z., & Yu, G. (2019). Class-balanced grouping and sampling for point cloud 3d object detection. arXiv."},{"key":"2176_CR49","unstructured":"Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2020). Deformable detr: Deformable transformers for end-to-end object detection. arXiv."}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-024-02176-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-024-02176-7\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-024-02176-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,15]],"date-time":"2024-11-15T10:29:02Z","timestamp":1731666542000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-024-02176-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,16]]},"references-count":49,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2024,12]]}},"alternative-id":["2176"],"URL":"https:\/\/doi.org\/10.1007\/s11263-024-02176-7","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,7,16]]},"assertion":[{"value":"31 August 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 July 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 July 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}