{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T15:19:38Z","timestamp":1772119178830,"version":"3.50.1"},"reference-count":48,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,12,13]],"date-time":"2024-12-13T00:00:00Z","timestamp":1734048000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,12,13]],"date-time":"2024-12-13T00:00:00Z","timestamp":1734048000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Technische Hochschule N\u00fcrnberg"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Machine Vision and Applications"],"published-print":{"date-parts":[[2025,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Humans are still indispensable on industrial assembly lines, but in the event of an error, they need support from intelligent systems. In addition to the objects to be observed, it is equally important to understand the fine-grained hand movements of a human to be able to track the entire process. However, these deep-learning-based hand action recognition methods are very label intensive, which cannot be offered by all industrial companies due to the associated costs. This work therefore presents a self-supervised learning approach for industrial assembly processes that allows a spatio-temporal transformer architecture to be pre-trained on a variety of information from real-world video footage of daily life. Subsequently, this deep learning model is adapted to the industrial assembly task at hand using only a few labels. Well-known real-world datasets best suited for representation learning of such hand actions in a regression tasks are outlined and to what extent they optimize the subsequent supervised trained classification task. This subsequent fine-tuning is supplemented by concept drift detection, which makes the resulting productively employed models more robust against concept drift and future changing assembly movements.<\/jats:p>","DOI":"10.1007\/s00138-024-01638-9","type":"journal-article","created":{"date-parts":[[2024,12,12]],"date-time":"2024-12-12T22:51:26Z","timestamp":1734043886000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Self-supervised representation learning for robust fine-grained human hand action recognition in industrial assembly lines"],"prefix":"10.1007","volume":"36","author":[{"given":"Fabian","family":"Sturm","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Martin","family":"Trat","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rahul","family":"Sathiyababu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Harshitha","family":"Allipilli","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Benjamin","family":"Menz","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Elke","family":"Hergenroether","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Melanie","family":"Siegel","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2024,12,13]]},"reference":[{"key":"1638_CR1","doi-asserted-by":"crossref","unstructured":"Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceedings of the 26th annual international conference on machine learning. pp. 41\u201348 (2009)","DOI":"10.1145\/1553374.1553380"},{"key":"1638_CR2","doi-asserted-by":"crossref","unstructured":"Bifet, A., Gavalda, R.: Learning from time-changing data with adaptive windowing. In: Proceedings of the 2007 SIAM international conference on data mining. pp. 443\u2013448. SIAM (2007)","DOI":"10.1137\/1.9781611972771.42"},{"key":"1638_CR3","unstructured":"Cao, S., Xu, P., Clifton, D.A.: How to understand masked autoencoders. arXiv preprint arXiv:2202.03670 (2022)"},{"key":"1638_CR4","doi-asserted-by":"publisher","unstructured":"Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). pp. 4171\u20134186. Association for Computational Linguistics (2019). https:\/\/doi.org\/10.18653\/v1\/n19-1423","DOI":"10.18653\/v1\/n19-1423"},{"key":"1638_CR5","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale (2021)"},{"key":"1638_CR6","first-page":"35946","volume":"35","author":"C Feichtenhofer","year":"2022","unstructured":"Feichtenhofer, C., Li, Y., He, K., et al.: Masked autoencoders as spatiotemporal learners. Adv. Neural. Inf. Process. Syst. 35, 35946\u201335958 (2022)","journal-title":"Adv. Neural. Inf. Process. Syst."},{"issue":"4","key":"1638_CR7","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/2523813","volume":"46","author":"J Gama","year":"2014","unstructured":"Gama, J., Zliobait\u0117, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. 46(4), 1\u201337 (2014). https:\/\/doi.org\/10.1145\/2523813","journal-title":"ACM Comput. Surv."},{"issue":"4","key":"1638_CR8","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/2523813","volume":"46","author":"J Gama","year":"2014","unstructured":"Gama, J., \u017dliobait\u0117, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. (CSUR) 46(4), 1\u201337 (2014)","journal-title":"ACM Comput. Surv. (CSUR)"},{"key":"1638_CR9","doi-asserted-by":"crossref","unstructured":"Goyal, R., Kahou, S.E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., Memisevic, R.: The \u201csomething something\u201d video database for learning and evaluating visual common sense (2017). arXiv:1706.04261","DOI":"10.1109\/ICCV.2017.622"},{"key":"1638_CR10","doi-asserted-by":"crossref","unstructured":"He, K., Chen, X., Xie, S., Li, Y., Doll\u00e1r, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. pp. 16000\u201316009 (2022)","DOI":"10.1109\/CVPR52688.2022.01553"},{"key":"1638_CR11","unstructured":"Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016)"},{"issue":"7","key":"1638_CR12","doi-asserted-by":"publisher","first-page":"1527","DOI":"10.1162\/neco.2006.18.7.1527","volume":"18","author":"GE Hinton","year":"2006","unstructured":"Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527\u20131554 (2006)","journal-title":"Neural Comput."},{"key":"1638_CR13","doi-asserted-by":"publisher","first-page":"4806","DOI":"10.1109\/ACCESS.2019.2962617","volume":"8","author":"Y Ho","year":"2019","unstructured":"Ho, Y., Wookey, S.: The real-world-weight cross-entropy loss function: Modeling the costs of mislabeling. IEEE Access 8, 4806\u20134813 (2019)","journal-title":"IEEE Access"},{"key":"1638_CR14","unstructured":"Hu, M., Kapoor, B., Akella, P., Prager, D.: The state of human factory analytics (2018), https:\/\/info.kearney.com\/30\/2769\/uploads\/the-state-of-human-factory-analytics.pdf?intIaContactId=eAsAAnVQ4FJww4J%2fWxZkpg%3d%3d&intExternalSystemId=1, accessed: 07\/25\/2024"},{"key":"1638_CR15","unstructured":"Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Bach, F.R., Blei, D.M. (eds.) Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015. JMLR Workshop and Conference Proceedings, vol.\u00a037, pp. 448\u2013456. JMLR.org (2015), http:\/\/proceedings.mlr.press\/v37\/ioffe15.html"},{"key":"1638_CR16","doi-asserted-by":"publisher","first-page":"1532","DOI":"10.1109\/ACCESS.2018.2886026","volume":"7","author":"AS Iwashita","year":"2019","unstructured":"Iwashita, A.S., Papa, J.P.: An Overview on Concept Drift Learning. IEEE Access 7, 1532\u20131547 (2019). https:\/\/doi.org\/10.1109\/ACCESS.2018.2886026","journal-title":"IEEE Access"},{"issue":"1","key":"1638_CR17","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s12530-016-9168-2","volume":"9","author":"I Khamassi","year":"2018","unstructured":"Khamassi, I., Sayed-Mouchaweh, M., Hammami, M., Gh\u00e9dira, K.: Discussion and review on evolving data streams and concept drift adapting. Evol. Syst. 9(1), 1\u201323 (2018). https:\/\/doi.org\/10.1007\/s12530-016-9168-2","journal-title":"Evol. Syst."},{"key":"1638_CR18","doi-asserted-by":"publisher","first-page":"132","DOI":"10.1016\/j.inffus.2017.02.004","volume":"37","author":"B Krawczyk","year":"2017","unstructured":"Krawczyk, B., Minku, L.L., Gama, J., Stefanowski, J., Wo\u017aniak, M.: Ensemble learning for data stream analysis: a survey. Inf. Fusion 37, 132\u2013156 (2017). https:\/\/doi.org\/10.1016\/j.inffus.2017.02.004","journal-title":"Inf. Fusion"},{"key":"1638_CR19","unstructured":"Li, Y., Si, S., Li, G., Hsieh, C.J., Bengio, S.: Learnable fourier features for multi-dimensional spatial positional encoding (2021)"},{"key":"1638_CR20","unstructured":"Li, Y., Liu, M., Rehg, J.M.: In the eye of the beholder: Gaze and actions in first person video (2020). arxiv:2006.00626"},{"key":"1638_CR21","unstructured":"Lin, T., Doll\u00e1r, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. CoRR abs\/1612.03144 (2016), arxiv:1612.03144"},{"key":"1638_CR22","doi-asserted-by":"crossref","unstructured":"Lin, T., Goyal, P., Girshick, R.B., He, K., Doll\u00e1r, P.: Focal loss for dense object detection. CoRR abs\/1708.02002 (2017), arxiv:1708.02002","DOI":"10.1109\/ICCV.2017.324"},{"key":"1638_CR23","unstructured":"Liu, M., Ren, S., Ma, S., Jiao, J., Chen, Y., Wang, Z., Song, W.: Gated transformer networks for multivariate time series classification. CoRR abs\/2103.14438 (2021), arxiv:2103.14438"},{"key":"1638_CR24","unstructured":"Mahdisoltani, F., Berger, G., Gharbieh, W., Fleet, D.J., Memisevic, R.: Fine-grained video classification and captioning. CoRR abs\/1804.09235 (2018), arxiv:1804.09235"},{"issue":"4","key":"1638_CR25","doi-asserted-by":"publisher","first-page":"619","DOI":"10.1109\/TKDE.2011.58","volume":"24","author":"LL Minku","year":"2012","unstructured":"Minku, L.L., Yao, X.: DDD: a new ensemble approach for dealing with concept drift. IEEE Trans. Knowl. Data Eng. 24(4), 619\u2013633 (2012). https:\/\/doi.org\/10.1109\/TKDE.2011.58","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"1638_CR26","unstructured":"Ng, A.: Sparse autoencoder (NA), http:\/\/www.stanford.edu\/class\/cs294a\/sparseAutoencoder.pdf"},{"issue":"1\/2","key":"1638_CR27","doi-asserted-by":"publisher","first-page":"100","DOI":"10.2307\/2333009","volume":"41","author":"ES Page","year":"1954","unstructured":"Page, E.S.: Continuous inspection schemes. Biometrika 41(1\/2), 100\u2013115 (1954)","journal-title":"Biometrika"},{"key":"1638_CR28","doi-asserted-by":"crossref","unstructured":"Sebasti\u00e3o, R., Fernandes, J.M.: Supporting the page-hinkley test with empirical mode decomposition for change detection. In: International Symposium on Methodologies for Intelligent Systems. pp. 492\u2013498. Springer (2017)","DOI":"10.1007\/978-3-319-60438-1_48"},{"key":"1638_CR29","doi-asserted-by":"publisher","first-page":"103","DOI":"10.1016\/j.procs.2015.07.284","volume":"53","author":"TS Sethi","year":"2015","unstructured":"Sethi, T.S., Kantardzic, M.: Don\u2019t pay for validation: detecting drifts from unlabeled data using margin density. Procedia Comput. Sci. 53, 103\u2013112 (2015). https:\/\/doi.org\/10.1016\/j.procs.2015.07.284","journal-title":"Procedia Comput. Sci."},{"key":"1638_CR30","doi-asserted-by":"publisher","first-page":"1079","DOI":"10.1007\/978-3-031-37717-4_70","volume-title":"Intelligent Computing","author":"F Sturm","year":"2023","unstructured":"Sturm, F., Hergenroether, E., Reinhardt, J., Vojnovikj, P.S., Siegel, M.: Challenges of the creation of a dataset for vision based human hand action recognition in industrial assembly. In: Arai, K. (ed.) Intelligent Computing, pp. 1079\u20131098. Springer Nature Switzerland, Cham (2023)"},{"key":"1638_CR31","doi-asserted-by":"crossref","unstructured":"Sturm, F., Sathiyababu, R., Allipilli, H., Hergenroether, E., Siegel, M.: Self-supervised representation learning for fine grained human hand action recognition in industrial assembly lines. In: International Symposium on Visual Computing. pp. 172\u2013184. Springer (2023)","DOI":"10.1007\/978-3-031-47969-4_14"},{"key":"1638_CR32","doi-asserted-by":"crossref","unstructured":"Tang, P., Zhang, X.: Mtsmae: Masked autoencoders for multivariate time-series forecasting. In: 2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI). pp. 982\u2013989. IEEE (2022)","DOI":"10.1109\/ICTAI56018.2022.00150"},{"key":"1638_CR33","first-page":"10078","volume":"35","author":"Z Tong","year":"2022","unstructured":"Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural. Inf. Process. Syst. 35, 10078\u201310093 (2022)","journal-title":"Adv. Neural. Inf. Process. Syst."},{"key":"1638_CR34","unstructured":"Trockman, A., Kolter, J.Z.: Patches are all you need? Trans. Mach. Learn. Res. 2023 (2022)"},{"key":"1638_CR35","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)"},{"key":"1638_CR36","doi-asserted-by":"publisher","DOI":"10.1038\/s41598-022-15245-z","author":"D Vela","year":"2022","unstructured":"Vela, D., Sharp, A., Zhang, R., Nguyen, T., Hoang, A., Pianykh, O.S.: Temporal quality degradation in AI models. Sci. Rep. (2022). https:\/\/doi.org\/10.1038\/s41598-022-15245-z","journal-title":"Sci. Rep."},{"key":"1638_CR37","doi-asserted-by":"publisher","unstructured":"Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning. pp. 1096\u20131103 (01 2008). https:\/\/doi.org\/10.1145\/1390156.1390294","DOI":"10.1145\/1390156.1390294"},{"key":"1638_CR38","unstructured":"Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, ll (Dec) pp. 3371\u20133408 (2010)"},{"issue":"12","key":"1638_CR39","first-page":"3371","volume":"11","author":"P Vincent","year":"2010","unstructured":"Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A., Bottou, L.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11(12), 3371\u20133408 (2010)","journal-title":"J. Mach. Learn. Res."},{"key":"1638_CR40","doi-asserted-by":"crossref","unstructured":"Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 98\u2013106 (2016)","DOI":"10.1109\/CVPR.2016.18"},{"key":"1638_CR41","doi-asserted-by":"crossref","unstructured":"Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., Murphy, K.: Tracking emerges by colorizing videos. In: Proceedings of the European conference on computer vision (ECCV). pp. 391\u2013408 (2018)","DOI":"10.1007\/978-3-030-01261-8_24"},{"issue":"11","key":"1638_CR42","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s42452-019-1433-0","volume":"1","author":"S Wares","year":"2019","unstructured":"Wares, S., Isaacs, J., Elyan, E.: Data stream mining: methods and challenges for handling concept drift. SN Appl. Sci. 1(11), 1\u201319 (2019). https:\/\/doi.org\/10.1007\/s42452-019-1433-0","journal-title":"SN Appl. Sci."},{"issue":"4","key":"1638_CR43","doi-asserted-by":"publisher","first-page":"964","DOI":"10.1007\/s10618-015-0448-4","volume":"30","author":"GI Webb","year":"2016","unstructured":"Webb, G.I., Hyde, R., Cao, H., Nguyen, H.L., Petitjean, F.: Characterizing concept drift. Data Min. Knowl. Disc. 30(4), 964\u2013994 (2016). https:\/\/doi.org\/10.1007\/s10618-015-0448-4","journal-title":"Data Min. Knowl. Disc."},{"key":"1638_CR44","doi-asserted-by":"crossref","unstructured":"Wu, W., Hua, Y., Wu, S., Chen, C., Lu, A., et al.: Skeletonmae: Spatial-temporal masked autoencoders for self-supervised skeleton action recognition. arXiv preprint arXiv:2209.02399 (2022)","DOI":"10.1109\/ICMEW59549.2023.00045"},{"key":"1638_CR45","doi-asserted-by":"crossref","unstructured":"Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., Hu, H.: Simmim: a simple framework for masked image modeling. 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 9643\u20139653 (2021)","DOI":"10.1109\/CVPR52688.2022.00943"},{"key":"1638_CR46","doi-asserted-by":"crossref","unstructured":"Zerveas, G., Jayaraman, S., Patel, D., Bhamidipaty, A., Eickhoff, C.: A transformer-based framework for multivariate time series representation learning. In: Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. pp. 2114\u20132124 (2021)","DOI":"10.1145\/3447548.3467401"},{"key":"1638_CR47","unstructured":"Zhang, F., Bazarevsky, V., Vakunov, A., Tkachenka, A., Sung, G., Chang, C., Grundmann, M.: Mediapipe hands: On-device real-time hand tracking. CoRR abs\/2006.10214 (2020), arxiv:2006.10214"},{"key":"1638_CR48","doi-asserted-by":"crossref","unstructured":"\u017dliobait\u0117, I., Pechenizkiy, M., Gama, J.: An overview of concept drift applications. Big data analysis: new algorithms for a new society pp. 91\u2013114 (2016)","DOI":"10.1007\/978-3-319-26989-4_4"}],"container-title":["Machine Vision and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00138-024-01638-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00138-024-01638-9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00138-024-01638-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,1,18]],"date-time":"2025-01-18T05:29:41Z","timestamp":1737178181000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00138-024-01638-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,12,13]]},"references-count":48,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,1]]}},"alternative-id":["1638"],"URL":"https:\/\/doi.org\/10.1007\/s00138-024-01638-9","relation":{"has-preprint":[{"id-type":"doi","id":"10.21203\/rs.3.rs-4347681\/v1","asserted-by":"object"}]},"ISSN":["0932-8092","1432-1769"],"issn-type":[{"value":"0932-8092","type":"print"},{"value":"1432-1769","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,12,13]]},"assertion":[{"value":"30 April 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 July 2024","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 November 2024","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 December 2024","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"19"}}