{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,30]],"date-time":"2026-04-30T16:47:30Z","timestamp":1777567650547,"version":"3.51.4"},"publisher-location":"Cham","reference-count":95,"publisher":"Springer Nature Switzerland","isbn-type":[{"value":"9783031730382","type":"print"},{"value":"9783031730399","type":"electronic"}],"license":[{"start":{"date-parts":[[2024,10,31]],"date-time":"2024-10-31T00:00:00Z","timestamp":1730332800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,10,31]],"date-time":"2024-10-31T00:00:00Z","timestamp":1730332800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Natural behavior is hierarchical. Yet, there is a paucity of benchmarks addressing this aspect. Recognizing the scarcity of large-scale hierarchical behavioral benchmarks, we create a novel synthetic basketball playing benchmark (Shot7M2). Beyond synthetic data, we extend BABEL into a hierarchical action segmentation benchmark (hBABEL). Then, we develop a masked autoencoder framework (hBehaveMAE) to elucidate the hierarchical nature of motion capture data in an unsupervised fashion. We find that hBehaveMAE learns interpretable latents on Shot7M2 and hBABEL, where lower encoder levels show a superior ability to represent fine-grained movements, while higher encoder levels capture complex actions and activities. Additionally, we evaluate hBehaveMAE on MABe22, a representation learning benchmark with short and long-term behavioral states. hBehaveMAE achieves state-of-the-art performance without domain-specific feature extraction. Together, these components synergistically contribute towards unveiling the hierarchical organization of natural behavior. Models and benchmarks are available at <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/amathislab\/BehaveMAE\">https:\/\/github.com\/amathislab\/BehaveMAE<\/jats:ext-link>.<\/jats:p>","DOI":"10.1007\/978-3-031-73039-9_7","type":"book-chapter","created":{"date-parts":[[2024,10,30]],"date-time":"2024-10-30T14:57:07Z","timestamp":1730300227000},"page":"106-125","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Elucidating the\u00a0Hierarchical Nature of\u00a0Behavior with\u00a0Masked Autoencoders"],"prefix":"10.1007","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-2147-3096","authenticated-orcid":false,"given":"Lucas","family":"Stoffl","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0009-0000-9006-7177","authenticated-orcid":false,"given":"Andy","family":"Bonnetto","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3131-3371","authenticated-orcid":false,"given":"St\u00e9phane","family":"d\u2019Ascoli","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3777-2202","authenticated-orcid":false,"given":"Alexander","family":"Mathis","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,10,31]]},"reference":[{"issue":"1","key":"7_CR1","doi-asserted-by":"publisher","first-page":"18","DOI":"10.1016\/j.neuron.2014.09.005","volume":"84","author":"DJ Anderson","year":"2014","unstructured":"Anderson, D.J., Perona, P.: Toward a science of computational ethology. Neuron 84(1), 18\u201331 (2014)","journal-title":"Neuron"},{"key":"7_CR2","doi-asserted-by":"crossref","unstructured":"Athanasiou, N., Petrovich, M., Black, M.J., Varol, G.: TEACH: temporal action composition for 3D humans. In: 2022 International Conference on 3D Vision (3DV), pp. 414\u2013423. IEEE (2022)","DOI":"10.1109\/3DV57658.2022.00053"},{"key":"7_CR3","doi-asserted-by":"crossref","unstructured":"Athanasiou, N., Petrovich, M., Black, M.J., Varol, G.: SINC: spatial composition of 3D human motions for simultaneous action generation. In: International Conference on Computer Vision (ICCV) (2023)","DOI":"10.1109\/ICCV51070.2023.00916"},{"key":"7_CR4","unstructured":"Azabou, M., et al.: Relax, it doesn\u2019t matter how you get there: a new self-supervised approach for multi-timescale behavior analysis. In: Advances in Neural Information Processing Systems, vol. 36 (2023)"},{"key":"7_CR5","series-title":"LNCS","doi-asserted-by":"publisher","first-page":"348","DOI":"10.1007\/978-3-031-19836-6_20","volume-title":"ECCV 2022","author":"R Bachmann","year":"2022","unstructured":"Bachmann, R., Mizrahi, D., Atanov, A., Zamir, A.: MultiMAE: multi-modal multi-task masked autoencoders. In: Avidan, S., Brostow, G., Ciss\u00e9, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13697, pp. 348\u2013367. Springer, Cham (2022). https:\/\/doi.org\/10.1007\/978-3-031-19836-6_20"},{"key":"7_CR6","unstructured":"Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M.: Data2vec: a general framework for self-supervised learning in speech, vision and language. In: International Conference on Machine Learning, pp. 1298\u20131312. PMLR (2022)"},{"issue":"99","key":"7_CR7","doi-asserted-by":"publisher","first-page":"20140672","DOI":"10.1098\/rsif.2014.0672","volume":"11","author":"GJ Berman","year":"2014","unstructured":"Berman, G.J., Choi, D.M., Bialek, W., Shaevitz, J.W.: Mapping the stereotyped behaviour of freely moving fruit flies. J. R. Soc. Interface 11(99), 20140672 (2014)","journal-title":"J. R. Soc. Interface"},{"key":"7_CR8","unstructured":"Bernstein, N.A.: The Co-ordination and Regulation of Movements, vol. 1. Pergamon Press, Oxford, New York (1967)"},{"issue":"5","key":"7_CR9","doi-asserted-by":"publisher","first-page":"201","DOI":"10.1016\/j.tics.2008.02.009","volume":"12","author":"MM Botvinick","year":"2008","unstructured":"Botvinick, M.M.: Hierarchical models of behavior and prefrontal function. Trends Cogn. Sci. 12(5), 201\u2013208 (2008)","journal-title":"Trends Cogn. Sci."},{"key":"7_CR10","unstructured":"Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877\u20131901 (2020)"},{"key":"7_CR11","series-title":"LNCS","doi-asserted-by":"publisher","first-page":"185","DOI":"10.1007\/978-3-031-19809-0_11","volume-title":"ECCV 2022","author":"Y Chen","year":"2022","unstructured":"Chen, Y., et al.: Hierarchically self-supervised transformer for human skeleton representation learning. In: Avidan, S., Brostow, G., Ciss\u00e9, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13686, pp. 185\u2013202. Springer, Cham (2022). https:\/\/doi.org\/10.1007\/978-3-031-19809-0_11"},{"key":"7_CR12","doi-asserted-by":"crossref","unstructured":"Chen, Y., Zhang, Z., Yuan, C., Li, B., Deng, Y., Hu, W.: Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 13359\u201313368 (2021)","DOI":"10.1109\/ICCV48922.2021.01311"},{"key":"7_CR13","unstructured":"Chu, X., et al.: Twins: revisiting the design of spatial attention in vision transformers. In: Advances in Neural Information Processing Systems, vol. 34, pp. 9355\u20139366 (2021)"},{"key":"7_CR14","unstructured":"Chunhui, L., Yueyu, H., Yanghao, L., Sijie, S., Jiaying, L.: PKU-MMD: a large scale benchmark for continuous multi-modal human action understanding. arXiv preprint arXiv:1703.07475 (2017)"},{"key":"7_CR15","unstructured":"Co-Reyes, J., Liu, Y., Gupta, A., Eysenbach, B., Abbeel, P., Levine, S.: Self-consistent trajectory autoencoder: hierarchical reinforcement learning with trajectory embeddings. In: International Conference on Machine Learning, pp. 1009\u20131018. PMLR (2018)"},{"key":"7_CR16","doi-asserted-by":"crossref","unstructured":"Damen, D., et\u00a0al.: Rescaling egocentric vision: collection, pipeline and challenges for EPIC-KITCHENS-100. Int. J. Comput. Vis., 1\u201323 (2022)","DOI":"10.1007\/s11263-021-01531-2"},{"issue":"1","key":"7_CR17","doi-asserted-by":"publisher","first-page":"11","DOI":"10.1016\/j.neuron.2019.09.038","volume":"104","author":"SR Datta","year":"2019","unstructured":"Datta, S.R., Anderson, D.J., Branson, K., Perona, P., Leifer, A.: Computational neuroethology: a call to action. Neuron 104(1), 11\u201324 (2019)","journal-title":"Neuron"},{"key":"7_CR18","unstructured":"Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)"},{"key":"7_CR19","unstructured":"Dosovitskiy, A., et\u00a0al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)"},{"key":"7_CR20","doi-asserted-by":"crossref","unstructured":"Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2758\u20132766 (2015)","DOI":"10.1109\/ICCV.2015.316"},{"key":"7_CR21","doi-asserted-by":"crossref","unstructured":"Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revisiting skeleton-based action recognition. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 2969\u20132978 (2022)","DOI":"10.1109\/CVPR52688.2022.00298"},{"key":"7_CR22","doi-asserted-by":"crossref","unstructured":"Fan, H., et al.: Multiscale vision transformers. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 6824\u20136835 (2021)","DOI":"10.1109\/ICCV48922.2021.00675"},{"key":"7_CR23","unstructured":"Feichtenhofer, C., Li, Y., He, K., et al.: Masked autoencoders as spatiotemporal learners. In: Advances in Neural Information Processing Systems, vol. 35, pp. 35946\u201335958 (2022)"},{"key":"7_CR24","doi-asserted-by":"publisher","first-page":"219","DOI":"10.1007\/s11263-013-0677-1","volume":"107","author":"A Gaidon","year":"2014","unstructured":"Gaidon, A., Harchaoui, Z., Schmid, C.: Activity representation with motion hierarchies. Int. J. Comput. Vision 107, 219\u2013238 (2014)","journal-title":"Int. J. Comput. Vision"},{"key":"7_CR25","doi-asserted-by":"crossref","unstructured":"Goodall, C.: Procrustes methods in the statistical analysis of shape. J. Roy. Stat. Soc. Ser. B (Methodol.) 53(2), 285\u2013321 (2018)","DOI":"10.1111\/j.2517-6161.1991.tb01825.x"},{"key":"7_CR26","doi-asserted-by":"crossref","unstructured":"Guo, C., et al.: Action2Motion: conditioned generation of 3D human motions. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2021\u20132029 (2020)","DOI":"10.1145\/3394171.3413635"},{"key":"7_CR27","doi-asserted-by":"publisher","first-page":"85","DOI":"10.1016\/j.cviu.2017.01.011","volume":"158","author":"F Han","year":"2017","unstructured":"Han, F., Reily, B., Hoff, W., Zhang, H.: Space-time representation of people based on 3D skeletal data: a review. Comput. Vis. Image Underst. 158, 85\u2013105 (2017)","journal-title":"Comput. Vis. Image Underst."},{"key":"7_CR28","series-title":"LNCS","doi-asserted-by":"publisher","first-page":"59","DOI":"10.1007\/978-3-031-20047-2_4","volume-title":"ECCV 2022","author":"AW Harley","year":"2022","unstructured":"Harley, A.W., Fang, Z., Fragkiadaki, K.: Particle video revisited: tracking through occlusions using point trajectories. In: Avidan, S., Brostow, G., Ciss\u00e9, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 59\u201375. Springer, Cham (2022). https:\/\/doi.org\/10.1007\/978-3-031-20047-2_4"},{"key":"7_CR29","doi-asserted-by":"publisher","first-page":"11","DOI":"10.1016\/j.conb.2021.04.004","volume":"70","author":"SB Hausmann","year":"2021","unstructured":"Hausmann, S.B., Vargas, A.M., Mathis, A., Mathis, M.W.: Measuring and modeling the motor system with machine learning. Curr. Opin. Neurobiol. 70, 11\u201323 (2021)","journal-title":"Curr. Opin. Neurobiol."},{"key":"7_CR30","doi-asserted-by":"crossref","unstructured":"He, K., Chen, X., Xie, S., Li, Y., Doll\u00e1r, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000\u201316009 (2022)","DOI":"10.1109\/CVPR52688.2022.01553"},{"issue":"1","key":"7_CR31","doi-asserted-by":"publisher","first-page":"5188","DOI":"10.1038\/s41467-021-25420-x","volume":"12","author":"AI Hsu","year":"2021","unstructured":"Hsu, A.I., Yttri, E.A.: B-SOiD, an open-source unsupervised algorithm for identification and fast prediction of behaviors. Nat. Commun. 12(1), 5188 (2021)","journal-title":"Nat. Commun."},{"key":"7_CR32","doi-asserted-by":"publisher","first-page":"3451","DOI":"10.1109\/TASLP.2021.3122291","volume":"29","author":"WN Hsu","year":"2021","unstructured":"Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: HuBERT: self-supervised speech representation learning by masked prediction of hidden units. IEEE\/ACM Trans. Audio Speech Lang. Process. 29, 3451\u20133460 (2021)","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"7_CR33","unstructured":"Huang, L., You, S., Zheng, M., Wang, F., Qian, C., Yamasaki, T.: Green hierarchical vision transformer for masked image modeling. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022)"},{"key":"7_CR34","unstructured":"Huang, P.Y., et al.: Masked autoencoders that listen. In: Advances in Neural Information Processing Systems, vol. 35, pp. 28708\u201328720 (2022)"},{"key":"7_CR35","doi-asserted-by":"crossref","unstructured":"Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol.\u00a02, p.\u00a06 (2017)","DOI":"10.1109\/CVPR.2017.179"},{"key":"7_CR36","unstructured":"Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., Carreira, J.: Perceiver: general perception with iterative attention. In: International Conference on Machine Learning, pp. 4651\u20134664. PMLR (2021)"},{"key":"7_CR37","doi-asserted-by":"publisher","first-page":"64","DOI":"10.1162\/tacl_a_00300","volume":"8","author":"M Joshi","year":"2020","unstructured":"Joshi, M., Chen, D., Liu, Y., Weld, D.S., Zettlemoyer, L., Levy, O.: SpanBERT: improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguistics 8, 64\u201377 (2020)","journal-title":"Trans. Assoc. Comput. Linguistics"},{"key":"7_CR38","unstructured":"Kay, W., et\u00a0al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)"},{"key":"7_CR39","doi-asserted-by":"crossref","unstructured":"Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: International Conference on Computer Vision (2011)","DOI":"10.1109\/ICCV.2011.6126543"},{"key":"7_CR40","unstructured":"Lashley, K.S., et\u00a0al.: The Problem of Serial Order in Behavior, vol.\u00a021. Bobbs-Merrill, Oxford (1951)"},{"key":"7_CR41","unstructured":"Li, S.J., AbuFarha, Y., Liu, Y., Cheng, M.M., Gall, J.: MS-TCN++: multi-stage temporal convolutional network for action segmentation. IEEE Trans. Pattern Anal. Mach. Intell. (2020)"},{"key":"7_CR42","doi-asserted-by":"crossref","unstructured":"Li, Y., et al.: MViTv2: improved multiscale vision transformers for classification and detection. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 4804\u20134814 (2022)","DOI":"10.1109\/CVPR52688.2022.00476"},{"issue":"10","key":"7_CR43","doi-asserted-by":"publisher","first-page":"2684","DOI":"10.1109\/TPAMI.2019.2916873","volume":"42","author":"J Liu","year":"2019","unstructured":"Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.Y., Kot, A.C.: NTU RGB+ D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2684\u20132701 (2019)","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"7_CR44","doi-asserted-by":"crossref","unstructured":"Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 10012\u201310022 (2021)","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"7_CR45","doi-asserted-by":"crossref","unstructured":"Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model, vol.\u00a02, pp. 851\u2013866. Association for Computing Machinery (2023)","DOI":"10.1145\/3596711.3596800"},{"key":"7_CR46","unstructured":"Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019)"},{"key":"7_CR47","unstructured":"Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)"},{"issue":"1","key":"7_CR48","doi-asserted-by":"publisher","first-page":"1267","DOI":"10.1038\/s42003-022-04080-7","volume":"5","author":"K Luxem","year":"2022","unstructured":"Luxem, K., et al.: Identifying behavioral structure from deep variational embeddings of animal motion. Commun. Biol. 5(1), 1267 (2022)","journal-title":"Commun. Biol."},{"key":"7_CR49","doi-asserted-by":"crossref","unstructured":"Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: International Conference on Computer Vision, pp. 5442\u20135451 (Oct 2019)","DOI":"10.1109\/ICCV.2019.00554"},{"key":"7_CR50","doi-asserted-by":"crossref","unstructured":"Mao, Y., Deng, J., Zhou, W., Fang, Y., Ouyang, W., Li, H.: Masked motion predictors are strong 3D action representation learners. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 10181\u201310191 (2023)","DOI":"10.1109\/ICCV51070.2023.00934"},{"issue":"1","key":"7_CR51","doi-asserted-by":"publisher","first-page":"44","DOI":"10.1016\/j.cell.2018.04.019","volume":"174","author":"JE Markowitz","year":"2018","unstructured":"Markowitz, J.E., et al.: The striatum organizes 3D behavior via moment-to-moment action selection. Cell 174(1), 44\u201358 (2018)","journal-title":"Cell"},{"key":"7_CR52","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.conb.2019.10.008","volume":"60","author":"MW Mathis","year":"2020","unstructured":"Mathis, M.W., Mathis, A.: Deep learning tools for the measurement of animal behavior in neuroscience. Curr. Opin. Neurobiol. 60, 1\u201311 (2020)","journal-title":"Curr. Opin. Neurobiol."},{"key":"7_CR53","doi-asserted-by":"crossref","unstructured":"Mittelstadt, B., Russell, C., Wachter, S.: Explaining explanations in AI. In: Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 279\u2013288 (2019)","DOI":"10.1145\/3287560.3287574"},{"key":"7_CR54","unstructured":"Nguyen, X.P., Joty, S., Hoi, S., Socher, R.: Tree-structured attention with hierarchical accumulation. In: International Conference on Learning Representations (2020)"},{"key":"7_CR55","doi-asserted-by":"crossref","unstructured":"Patel, P., Huang, C.H.P., Tesch, J., Hoffmann, D.T., Tripathi, S., Black, M.J.: AGORA: avatars in geography optimized for regression analysis. In: Proceedings IEEE\/CVF Conference\u00a0on Computer Vision and Pattern Recognition (CVPR), June 2021","DOI":"10.1109\/CVPR46437.2021.01326"},{"key":"7_CR56","series-title":"LNCS","doi-asserted-by":"publisher","first-page":"480","DOI":"10.1007\/978-3-031-20047-2_28","volume-title":"ECCV 2022","author":"M Petrovich","year":"2022","unstructured":"Petrovich, M., Black, M.J., Varol, G.: TEMOS: generating diverse human motions from textual descriptions. In: Avidan, S., Brostow, G., Ciss\u00e9, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 480\u2013497. Springer, Cham (2022). https:\/\/doi.org\/10.1007\/978-3-031-20047-2_28"},{"key":"7_CR57","doi-asserted-by":"crossref","unstructured":"Punnakkal, A.R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: BABEL: bodies, action and behavior with English labels. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 722\u2013731 (2021)","DOI":"10.1109\/CVPR46437.2021.00078"},{"key":"7_CR58","unstructured":"Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652\u2013660 (2017)"},{"key":"7_CR59","doi-asserted-by":"crossref","unstructured":"Qi, H., Zhao, C., Salzmann, M., Mathis, A.: HOISDF: constraining 3D hand-object pose estimation with global signed distance fields. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 10392\u201310402 (2024)","DOI":"10.1109\/CVPR52733.2024.00989"},{"key":"7_CR60","unstructured":"Ryali, C., Hu, et al.: Hiera: a hierarchical vision transformer without the bells-and-whistles. In: ICML (2023)"},{"key":"7_CR61","doi-asserted-by":"crossref","unstructured":"Sener, F., et al: Assembly101: a large-scale multi-view video dataset for understanding procedural activities. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 21096\u201321106 (2022)","DOI":"10.1109\/CVPR52688.2022.02042"},{"key":"7_CR62","doi-asserted-by":"crossref","unstructured":"Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+ D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010\u20131019 (2016)","DOI":"10.1109\/CVPR.2016.115"},{"issue":"10","key":"7_CR63","doi-asserted-by":"publisher","first-page":"11484","DOI":"10.1109\/TPAMI.2023.3284080","volume":"45","author":"D Singhania","year":"2023","unstructured":"Singhania, D., Rahaman, R., Yao, A.: C2F-TCN: a framework for semi- and fully-supervised temporal action segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 45(10), 11484\u201311501 (2023)","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"7_CR64","doi-asserted-by":"crossref","unstructured":"Singhania, D., Rahaman, R., Yao, A.: Iterative contrast-classify for semi-supervised temporal action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol.\u00a036, pp. 2262\u20132270 (2022)","DOI":"10.1609\/aaai.v36i2.20124"},{"key":"7_CR65","doi-asserted-by":"publisher","DOI":"10.1016\/j.jvcir.2021.103055","volume":"76","author":"L Song","year":"2021","unstructured":"Song, L., Yu, G., Yuan, J., Liu, Z.: Human pose estimation and its application to action recognition: a survey. J. Vis. Commun. Image Represent. 76, 103055 (2021)","journal-title":"J. Vis. Commun. Image Represent."},{"key":"7_CR66","unstructured":"Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)"},{"key":"7_CR67","doi-asserted-by":"crossref","unstructured":"Starke, S., Zhang, H., Komura, T., Saito, J.: Neural state machine for character-scene interactions (2019)","DOI":"10.1145\/3355089.3356505"},{"key":"7_CR68","doi-asserted-by":"crossref","unstructured":"Starke, S., Zhao, Y., Komura, T., Zaman, K.: Local motion phases for learning multi-contact character movements. Assoc. Comput. Mach. Trans. Graph. (TOG) 39(4) (2020). 54\u20131","DOI":"10.1145\/3386569.3392450"},{"key":"7_CR69","doi-asserted-by":"crossref","unstructured":"Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 7464\u20137473 (2019)","DOI":"10.1109\/ICCV.2019.00756"},{"key":"7_CR70","unstructured":"Sun, J.J., et al.: The multi-agent behavior dataset: mouse dyadic social interactions. CoRR abs\/2104.02710 (2021)"},{"key":"7_CR71","doi-asserted-by":"crossref","unstructured":"Sun, J.J., Kennedy, A., Zhan, E., Anderson, D.J., Yue, Y., Perona, P.: Task programming: learning data efficient behavior representations. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 2876\u20132885 (2021)","DOI":"10.1109\/CVPR46437.2021.00290"},{"key":"7_CR72","unstructured":"Sun, J.J., et\u00a0al.: MABe22: a multi-species multi-task benchmark for learned representations of behavior. In: International Conference on Machine Learning, pp. 32936\u201332990. PMLR (2023)"},{"issue":"4","key":"7_CR73","doi-asserted-by":"publisher","first-page":"410","DOI":"10.1111\/j.1439-0310.1963.tb01161.x","volume":"20","author":"N Tinbergen","year":"1963","unstructured":"Tinbergen, N.: On aims and methods of ethology. Z. Tierpsychol. 20(4), 410\u2013433 (1963)","journal-title":"Z. Tierpsychol."},{"key":"7_CR74","unstructured":"Tong, Z., Song, Y., Wang, J., Wang, L.: VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training. In: Advances in Neural Information Processing Systems (2022)"},{"issue":"1","key":"7_CR75","doi-asserted-by":"publisher","first-page":"792","DOI":"10.1038\/s41467-022-27980-y","volume":"13","author":"D Tuia","year":"2022","unstructured":"Tuia, D., et al.: Perspectives in machine learning for wildlife conservation. Nat. Commun. 13(1), 792 (2022)","journal-title":"Nat. Commun."},{"key":"7_CR76","unstructured":"Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)"},{"key":"7_CR77","doi-asserted-by":"crossref","unstructured":"Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096\u20131103 (2008)","DOI":"10.1145\/1390156.1390294"},{"key":"7_CR78","unstructured":"Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A., Bottou, L.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11(12) (2010)"},{"key":"7_CR79","doi-asserted-by":"crossref","unstructured":"Wang, H., Tang, Y., Wang, Y., Guo, J., Deng, Z.H., Han, K.: Masked image modeling with local multi-scale reconstruction. arXiv preprint arXiv:2303.05251 (2023)","DOI":"10.1109\/CVPR52729.2023.00211"},{"key":"7_CR80","doi-asserted-by":"crossref","unstructured":"Wang, L., et al.: VideoMAE V2: scaling video masked autoencoders with dual masking. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 14549\u201314560 (2023)","DOI":"10.1109\/CVPR52729.2023.01398"},{"key":"7_CR81","doi-asserted-by":"crossref","unstructured":"Wang, Q., Gao, J., Lin, W., Yuan, Y.: Learning from synthetic data for crowd counting in the wild. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8198\u20138207 (2019)","DOI":"10.1109\/CVPR.2019.00839"},{"key":"7_CR82","doi-asserted-by":"crossref","unstructured":"Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 568\u2013578 (2021)","DOI":"10.1109\/ICCV48922.2021.00061"},{"key":"7_CR83","doi-asserted-by":"crossref","unstructured":"Wei, C., Fan, H., Xie, S., Wu, C.Y., Yuille, A., Feichtenhofer, C.: Masked feature prediction for self-supervised visual pre-training. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 14668\u201314678 (2022)","DOI":"10.1109\/CVPR52688.2022.01426"},{"issue":"7","key":"7_CR84","doi-asserted-by":"publisher","first-page":"1329","DOI":"10.1038\/s41592-024-02318-2","volume":"21","author":"C Weinreb","year":"2024","unstructured":"Weinreb, C., et al.: Keypoint-MoSeq: parsing behavior by linking point tracking to pose dynamics. Nat. Methods 21(7), 1329\u20131339 (2024)","journal-title":"Nat. Methods"},{"issue":"6","key":"7_CR85","doi-asserted-by":"publisher","first-page":"1121","DOI":"10.1016\/j.neuron.2015.11.031","volume":"88","author":"AB Wiltschko","year":"2015","unstructured":"Wiltschko, A.B., et al.: Mapping sub-second structure in mouse behavior. Neuron 88(6), 1121\u20131135 (2015)","journal-title":"Neuron"},{"issue":"11","key":"7_CR86","doi-asserted-by":"publisher","first-page":"1433","DOI":"10.1038\/s41593-020-00706-3","volume":"23","author":"AB Wiltschko","year":"2020","unstructured":"Wiltschko, A.B., et al.: Revealing the structure of pharmacobehavioral space through motion sequencing. Nat. Neurosci. 23(11), 1433\u20131443 (2020)","journal-title":"Nat. Neurosci."},{"key":"7_CR87","series-title":"LNCS","doi-asserted-by":"publisher","first-page":"160","DOI":"10.1007\/978-3-031-19778-9_10","volume-title":"ECCV 2022","author":"E Wood","year":"2022","unstructured":"Wood, E., Baltru\u0161aitis, T.: 3D face reconstruction with dense landmarks. In: Avidan, S., Brostow, G., Ciss\u00e9, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13673, pp. 160\u2013177. Springer, Cham (2022). https:\/\/doi.org\/10.1007\/978-3-031-19778-9_10"},{"key":"7_CR88","doi-asserted-by":"crossref","unstructured":"Wu, W., Hua, Y., Wu, S., Chen, C., Lu, A., et\u00a0al.: SkeletonMAE: spatial-temporal masked autoencoders for self-supervised skeleton action recognition. arXiv preprint arXiv:2209.02399 (2022)","DOI":"10.1109\/ICMEW59549.2023.00045"},{"key":"7_CR89","doi-asserted-by":"crossref","unstructured":"Xie, Z., et al.: SimMIM: a simple framework for masked image modeling. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 9653\u20139663 (2022)","DOI":"10.1109\/CVPR52688.2022.00943"},{"key":"7_CR90","doi-asserted-by":"crossref","unstructured":"Yan, H., Liu, Y., Wei, Y., Li, Z., Li, G., Lin, L.: SkeletonMAE: graph-based masked autoencoder for skeleton sequence pre-training. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 5606\u20135618 (2023)","DOI":"10.1109\/ICCV51070.2023.00516"},{"key":"7_CR91","unstructured":"Ye, S., Lauer, J., Zhou, M., Mathis, A., Mathis, M.W.: AmadeusGPT: a natural language interface for interactive animal behavioral analysis. In: Thirty-seventh Conference on Neural Information Processing Systems (2023)"},{"key":"7_CR92","doi-asserted-by":"crossref","unstructured":"Yue, Z., et al.: TS2Vec: towards universal representation of time series. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol.\u00a036, pp. 8980\u20138987 (2022)","DOI":"10.1609\/aaai.v36i8.20881"},{"issue":"5","key":"7_CR93","doi-asserted-by":"publisher","first-page":"726","DOI":"10.1109\/TETCI.2021.3100641","volume":"5","author":"Y Zhang","year":"2021","unstructured":"Zhang, Y., Ti\u0148o, P., Leonardis, A., Tang, K.: A survey on neural network interpretability. IEEE Trans. Emerging Top. Comput. Intell. 5(5), 726\u2013742 (2021)","journal-title":"IEEE Trans. Emerging Top. Comput. Intell."},{"key":"7_CR94","doi-asserted-by":"crossref","unstructured":"Zheng, Y., Harley, A.W., Shen, B., Wetzstein, G., Guibas, L.J.: PointOdyssey: a large-scale synthetic dataset for long-term point tracking. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 19855\u201319865 (2023)","DOI":"10.1109\/ICCV51070.2023.01818"},{"key":"7_CR95","doi-asserted-by":"crossref","unstructured":"Zhu, W., Ma, X., Liu, Z., Liu, L., Wu, W., Wang, Y.: MotionBERT: a unified perspective on learning human motion representations. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 15085\u201315099 (2023)","DOI":"10.1109\/ICCV51070.2023.01385"}],"container-title":["Lecture Notes in Computer Science","Computer Vision \u2013 ECCV 2024"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/978-3-031-73039-9_7","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,10,30]],"date-time":"2024-10-30T15:31:16Z","timestamp":1730302276000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/978-3-031-73039-9_7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10,31]]},"ISBN":["9783031730382","9783031730399"],"references-count":95,"URL":"https:\/\/doi.org\/10.1007\/978-3-031-73039-9_7","relation":{},"ISSN":["0302-9743","1611-3349"],"issn-type":[{"value":"0302-9743","type":"print"},{"value":"1611-3349","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,10,31]]},"assertion":[{"value":"31 October 2024","order":1,"name":"first_online","label":"First Online","group":{"name":"ChapterHistory","label":"Chapter History"}},{"value":"ECCV","order":1,"name":"conference_acronym","label":"Conference Acronym","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"European Conference on Computer Vision","order":2,"name":"conference_name","label":"Conference Name","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Milan","order":3,"name":"conference_city","label":"Conference City","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Italy","order":4,"name":"conference_country","label":"Conference Country","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"2024","order":5,"name":"conference_year","label":"Conference Year","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"29 September 2024","order":7,"name":"conference_start_date","label":"Conference Start Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"4 October 2024","order":8,"name":"conference_end_date","label":"Conference End Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"18","order":9,"name":"conference_number","label":"Conference Number","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"eccv2024","order":10,"name":"conference_id","label":"Conference ID","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"https:\/\/eccv2024.ecva.net\/","order":11,"name":"conference_url","label":"Conference URL","group":{"name":"ConferenceInfo","label":"Conference Information"}}]}}