{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,24]],"date-time":"2025-09-24T00:14:58Z","timestamp":1758672898156,"version":"3.44.0"},"publisher-location":"California","reference-count":0,"publisher":"International Joint Conferences on Artificial Intelligence Organization","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,9]]},"abstract":"<jats:p>Category-level object pose estimation is a longstanding and fundamental task crucial for augmented reality and robotic manipulation applications. Existing RGB-based approaches struggle with multi-stage settings and heavily rely on off-the-shelf techniques, such as object detectors, depth estimators, non-differentiable NOCS shape alignment, etc. Extra dependencies lead to the accumulation of errors and complicate the whole pipeline, limiting the deployment of these approaches in practical applications. This paper streamlined an end-to-end framework unifying the single-frame and video-based category-level pose estimation. Specifically, instead of explicitly introducing extra dependencies, the DINOv2 encoder and depth decoder, as robust semantic and geometric prior extractors, are leveraged to produce intra-frame hierarchical semantic and geometric features. A spatial-temporal sparse query network is developed to model the implicit correspondence and inter-frame correlations between a set of implicit 3D query anchors and intra-frame features. Finally, a pose prediction head is employed using the bipartite matching algorithm. Experimental results demonstrate that our model achieves state-of-the-art performance compared with RGB-based categorical pose estimation methods on the REAL275 and CAMERA25 datasets. Our code is available at https:\/\/andrewchiyz.github.io\/vision.3dv.seqpose\/.<\/jats:p>","DOI":"10.24963\/ijcai.2025\/137","type":"proceedings-article","created":{"date-parts":[[2025,9,19]],"date-time":"2025-09-19T08:10:40Z","timestamp":1758269440000},"page":"1224-1232","source":"Crossref","is-referenced-by-count":0,"title":["SeqPose: An End-to-End Framework to Unify Single-frame and Video-based RGB Category-Level Pose Estimation"],"prefix":"10.24963","author":[{"given":"Yuzhu","family":"Ji","sequence":"first","affiliation":[{"name":"School of Computer Science and Technology, Guangdong University of Technology, China"}]},{"given":"Mingshan","family":"Sun","sequence":"additional","affiliation":[{"name":"CVTE Research, China"}]},{"given":"Jianyang","family":"Shi","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology, China"}]},{"given":"Xiaoke","family":"Jiang","sequence":"additional","affiliation":[{"name":"International Digital Economy Academy (IDEA), China"}]},{"given":"Yiqun","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Guangdong University of Technology, China"}]},{"given":"Haijun","family":"Zhang","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology, China"}]}],"member":"10584","event":{"number":"34","sponsor":["International Joint Conferences on Artificial Intelligence Organization (IJCAI)"],"acronym":"IJCAI-2025","name":"Thirty-Fourth International Joint Conference on Artificial Intelligence {IJCAI-25}","start":{"date-parts":[[2025,8,16]]},"theme":"Artificial Intelligence","location":"Montreal, Canada","end":{"date-parts":[[2025,8,22]]}},"container-title":["Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence"],"original-title":[],"deposited":{"date-parts":[[2025,9,23]],"date-time":"2025-09-23T11:33:06Z","timestamp":1758627186000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.ijcai.org\/proceedings\/2025\/137"}},"subtitle":[],"proceedings-subject":"Artificial Intelligence Research Articles","short-title":[],"issued":{"date-parts":[[2025,9]]},"references-count":0,"URL":"https:\/\/doi.org\/10.24963\/ijcai.2025\/137","relation":{},"subject":[],"published":{"date-parts":[[2025,9]]}}}