{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,30]],"date-time":"2026-06-30T15:52:25Z","timestamp":1782834745311,"version":"3.54.5"},"reference-count":25,"publisher":"Springer Science and Business Media LLC","issue":"10","license":[{"start":{"date-parts":[[2021,10,1]],"date-time":"2021-10-01T00:00:00Z","timestamp":1633046400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,10,16]],"date-time":"2021-10-16T00:00:00Z","timestamp":1634342400000},"content-version":"vor","delay-in-days":15,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100005713","name":"Technische Universit\u00e4t M\u00fcnchen","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100005713","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J CARS"],"published-print":{"date-parts":[[2021,10]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n                <jats:title>Purpose<\/jats:title>\n                <jats:p>Surgical gesture recognition has been an essential task for providing intraoperative context-aware assistance and scheduling clinical resources. However, previous methods present limitations in catching long-range temporal information, and many of them require additional sensors. To address these challenges, we propose a symmetric dilated network, namely <jats:bold>SD-Net<\/jats:bold>, to jointly recognize surgical gestures and assess surgical skill levels only using RGB surgical video sequences.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Methods<\/jats:title>\n                <jats:p>We utilize symmetric 1D temporal dilated convolution layers to hierarchically capture gesture clues under different receptive fields such that features in different time span can be aggregated. In addition, a self-attention network is bridged in the middle to calculate the global frame-to-frame relativity.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Results<\/jats:title>\n                <jats:p>We evaluate our method on a robotic suturing task from the JIGSAWS dataset. The gesture recognition task largely outperforms the state of the arts on the frame-wise accuracy up to <jats:inline-formula><jats:alternatives><jats:tex-math>$$\\sim $$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                    <mml:mo>\u223c<\/mml:mo>\n                  <\/mml:math><\/jats:alternatives><\/jats:inline-formula><jats:bold>6<\/jats:bold> points and the F1@50 score <jats:inline-formula><jats:alternatives><jats:tex-math>$$\\sim $$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                    <mml:mo>\u223c<\/mml:mo>\n                  <\/mml:math><\/jats:alternatives><\/jats:inline-formula><jats:bold>8<\/jats:bold> points. We also keep the 100% predicted accuracy for the skill assessment task using LOSO validation scheme.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Conclusion<\/jats:title>\n                <jats:p>The results indicate that our architecture is able to obtain representative surgical video features by extensively considering the spatial, temporal and relational context from raw video input. Furthermore, the better performance in multi-task learning implies that surgical skill assessment has a complementary effects to gesture recognition task.<\/jats:p>\n              <\/jats:sec>","DOI":"10.1007\/s11548-021-02495-x","type":"journal-article","created":{"date-parts":[[2021,10,16]],"date-time":"2021-10-16T20:55:28Z","timestamp":1634417728000},"page":"1675-1682","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":18,"title":["SD-Net: joint surgical gesture recognition and skill assessment"],"prefix":"10.1007","volume":"16","author":[{"given":"Jinglu","family":"Zhang","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7023-6797","authenticated-orcid":false,"given":"Yinyu","family":"Nie","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yao","family":"Lyu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Xiaosong","family":"Yang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jian","family":"Chang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jian Jun","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2021,10,16]]},"reference":[{"key":"2495_CR1","doi-asserted-by":"crossref","unstructured":"van Amsterdam B, Clarkson MJ, Stoyanov D (2020) Multi-task recurrent neural network for surgical gesture recognition and progress prediction. arXiv preprint arXiv:2003.04772","DOI":"10.1109\/ICRA40945.2020.9197301"},{"key":"2495_CR2","doi-asserted-by":"crossref","unstructured":"Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 10578\u201310587","DOI":"10.1109\/CVPR42600.2020.01059"},{"key":"2495_CR3","doi-asserted-by":"crossref","unstructured":"DiPietro R, Lea C, Malpani A, Ahmidi N, Vedula SS, Lee GI, Lee MR, Hager GD (2016) Recognizing surgical activities with recurrent neural networks. In: International conference on medical image computing and computer-assisted intervention, Springer, pp 551\u2013558","DOI":"10.1007\/978-3-319-46720-7_64"},{"key":"2495_CR4","doi-asserted-by":"crossref","unstructured":"Farha YA, Gall J (2019) Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3575\u20133584","DOI":"10.1109\/CVPR.2019.00369"},{"key":"2495_CR5","doi-asserted-by":"crossref","unstructured":"Fawaz HI, Forestier G, Weber J, Idoumghar L, Muller PA (2018) Evaluating surgical skills from kinematic data using convolutional neural networks. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, pp 214\u2013221","DOI":"10.1007\/978-3-030-00937-3_25"},{"key":"2495_CR6","doi-asserted-by":"crossref","unstructured":"Funke I, Bodenstedt S, Oehme F, von Bechtolsheim F, Weitz J, Speidel S (2019a) Using 3d convolutional neural networks to learn spatiotemporal features for automatic surgical gesture recognition in video. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, pp 467\u2013475","DOI":"10.1007\/978-3-030-32254-0_52"},{"issue":"7","key":"2495_CR7","doi-asserted-by":"publisher","first-page":"1217","DOI":"10.1007\/s11548-019-01995-1","volume":"14","author":"I Funke","year":"2019","unstructured":"Funke I, Mees ST, Weitz J, Speidel S (2019b) Video-based surgical skill assessment using 3d convolutional neural networks. Int J Comput Assist Radiol Surg 14(7):1217\u20131225","journal-title":"Int J Comput Assist Radiol Surg"},{"key":"2495_CR8","unstructured":"Gao Y, Vedula SS, Reiley CE, Ahmidi N, Varadarajan B, Lin HC, Tao L, Zappella L, B\u00e9jar B, Yuh DD, Chen CC, Vidal R, Khudanpur S, Hager GD (2014) Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling. In: MICCAI Workshop: M2CAI, vol\u00a03, p\u00a03"},{"key":"2495_CR9","doi-asserted-by":"crossref","unstructured":"Hu H, Gu J, Zhang Z, Dai J, Wei Y (2018) Relation networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3588\u20133597","DOI":"10.1109\/CVPR.2018.00378"},{"key":"2495_CR10","doi-asserted-by":"crossref","unstructured":"Lea C, Hager GD, Vidal R (2015) An improved model for segmentation and recognition of fine-grained activities with application to surgical training tasks. In: 2015 IEEE winter conference on applications of computer vision, IEEE, pp 1123\u20131129","DOI":"10.1109\/WACV.2015.154"},{"key":"2495_CR11","doi-asserted-by":"crossref","unstructured":"Lea C, Reiter A, Vidal R, Hager GD (2016) Segmental spatiotemporal cnns for fine-grained action segmentation. In: European Conference on Computer Vision, Springer, pp 36\u201352","DOI":"10.1007\/978-3-319-46487-9_3"},{"key":"2495_CR12","doi-asserted-by":"crossref","unstructured":"Lea C, Flynn MD, Vidal R, Reiter A, Hager GD (2017) Temporal convolutional networks for action segmentation and detection. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 156\u2013165","DOI":"10.1109\/CVPR.2017.113"},{"issue":"3","key":"2495_CR13","first-page":"1","volume":"1","author":"X Li","year":"2017","unstructured":"Li X, Zhang Y, Zhang J, Zhou M, Chen S, Gu Y, Chen Y, Marsic I, Farneth RA, Burd RS (2017) Progress estimation and phase detection for sequential processes. Proceed ACM Interactive, Mobile, Wearable Ubiquitous Technol 1(3):1\u201320","journal-title":"Proceed ACM Interactive, Mobile, Wearable Ubiquitous Technol"},{"key":"2495_CR14","doi-asserted-by":"crossref","unstructured":"Liu D, Jiang T (2018) Deep reinforcement learning for surgical gesture segmentation and classification. In: International conference on medical image computing and computer-assisted intervention, Springer, pp 247\u2013255","DOI":"10.1007\/978-3-030-00937-3_29"},{"key":"2495_CR15","doi-asserted-by":"crossref","unstructured":"Maier-Hein L, Vedula SS, Speidel S, Navab N, Kikinis R, Park A, Eisenmann M, Feussner H, Forestier G, Giannarou S, Hashizume M, Katic D, Kenngott H, Kranzfelder M, Malpani A, M\u00e3rz K, Neumuth T, Padoy N, Pugh C, Schoch N, Stoyanov D, Taylor R, Wagne M, Hager GD, Jannin P (2017) Surgical data science for next-generation interventions. Nature Biomed Eng 1(9):691\u2013696","DOI":"10.1038\/s41551-017-0132-7"},{"key":"2495_CR16","doi-asserted-by":"crossref","unstructured":"Mavroudi E, Bhaskara D, Sefati S, Ali H, Vidal R (2018) End-to-end fine-grained action segmentation and recognition using conditional random field models and discriminative sparse coding. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, pp 1558\u20131567","DOI":"10.1109\/WACV.2018.00174"},{"key":"2495_CR17","unstructured":"Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: International Conference on Machine Learning, pp 1310\u20131318"},{"key":"2495_CR18","doi-asserted-by":"crossref","unstructured":"Singh B, Marks TK, Jones M, Tuzel O, Shao M (2016) A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1961\u20131970","DOI":"10.1109\/CVPR.2016.216"},{"key":"2495_CR19","doi-asserted-by":"crossref","unstructured":"Tao L, Zappella L, Hager GD, Vidal R (2013) Surgical gesture segmentation and recognition. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, pp 339\u2013346","DOI":"10.1007\/978-3-642-40760-4_43"},{"issue":"1","key":"2495_CR20","doi-asserted-by":"publisher","first-page":"86","DOI":"10.1109\/TMI.2016.2593957","volume":"36","author":"AP Twinanda","year":"2016","unstructured":"Twinanda AP, Shehata S, Mutter D, Marescaux J, De Mathelin M, Padoy N (2016) Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE Trans Med Imaging 36(1):86\u201397","journal-title":"IEEE Trans Med Imaging"},{"key":"2495_CR21","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser \u0141, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, pp 5998\u20136008"},{"key":"2495_CR22","doi-asserted-by":"crossref","unstructured":"Wang T, Wang Y, Li M (2020) Towards accurate and interpretable surgical skill assessment: A video-based method incorporating recognized surgical gestures and skill levels. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, pp 668\u2013678","DOI":"10.1007\/978-3-030-59716-0_64"},{"key":"2495_CR23","doi-asserted-by":"crossref","unstructured":"Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7794\u20137803","DOI":"10.1109\/CVPR.2018.00813"},{"key":"2495_CR24","doi-asserted-by":"crossref","unstructured":"Zhang J, Nie Y, Lyu Y, Li H, Chang J, Yang X, Zhang JJ (2020a) Symmetric dilated convolution for surgical gesture recognition. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, pp 409\u2013418","DOI":"10.1007\/978-3-030-59716-0_39"},{"key":"2495_CR25","unstructured":"Zhang S, Guo S, Huang W, Scott MR, Wang L (2020b) V4d: 4d convolutional neural networks for video-level representation learning. arXiv preprint arXiv:2002.07442"}],"container-title":["International Journal of Computer Assisted Radiology and Surgery"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11548-021-02495-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11548-021-02495-x\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11548-021-02495-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,11,11]],"date-time":"2021-11-11T03:19:53Z","timestamp":1636600793000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11548-021-02495-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10]]},"references-count":25,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2021,10]]}},"alternative-id":["2495"],"URL":"https:\/\/doi.org\/10.1007\/s11548-021-02495-x","relation":{},"ISSN":["1861-6410","1861-6429"],"issn-type":[{"value":"1861-6410","type":"print"},{"value":"1861-6429","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,10]]},"assertion":[{"value":"8 February 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 September 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 October 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}},{"value":"This article does not contain any studies with human participants or animals performed by any of the authors.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethical approval"}},{"value":"This articles does not contain patient data.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Informed consent"}}]}}