{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,29]],"date-time":"2026-01-29T23:52:45Z","timestamp":1769730765832,"version":"3.49.0"},"reference-count":0,"publisher":"Association for the Advancement of Artificial Intelligence (AAAI)","issue":"01","license":[{"start":{"date-parts":[[2019,7,17]],"date-time":"2019-07-17T00:00:00Z","timestamp":1563321600000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.aaai.org"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["AAAI"],"abstract":"<jats:p>Despite the success of deep learning for static image understanding, it remains unclear what are the most effective network architectures for spatial-temporal modeling in videos. In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatialtemporal network (StNet) architecture for both local and global modeling in videos. Particularly, StNet stacks N successive video frames into a super-image which has 3N channels and applies 2D convolution on super-images to capture local spatial-temporal relationship. To model global spatialtemporal structure, we apply temporal convolution on the local spatial-temporal feature maps. Specifically, a novel temporal Xception block is proposed in StNet, which employs a separate channel-wise and temporal-wise convolution over the feature sequence of a video. Extensive experiments on the Kinetics dataset demonstrate that our framework outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity. We further demonstrate the generalization performance of the leaned video representations on the UCF101 dataset.<\/jats:p>","DOI":"10.1609\/aaai.v33i01.33018401","type":"journal-article","created":{"date-parts":[[2019,8,21]],"date-time":"2019-08-21T07:40:55Z","timestamp":1566373255000},"page":"8401-8408","source":"Crossref","is-referenced-by-count":101,"title":["StNet: Local and Global Spatial-Temporal Modeling for Action Recognition"],"prefix":"10.1609","volume":"33","author":[{"given":"Dongliang","family":"He","sequence":"first","affiliation":[]},{"given":"Zhichao","family":"Zhou","sequence":"additional","affiliation":[]},{"given":"Chuang","family":"Gan","sequence":"additional","affiliation":[]},{"given":"Fu","family":"Li","sequence":"additional","affiliation":[]},{"given":"Xiao","family":"Liu","sequence":"additional","affiliation":[]},{"given":"Yandong","family":"Li","sequence":"additional","affiliation":[]},{"given":"Limin","family":"Wang","sequence":"additional","affiliation":[]},{"given":"Shilei","family":"Wen","sequence":"additional","affiliation":[]}],"member":"9382","published-online":{"date-parts":[[2019,7,17]]},"container-title":["Proceedings of the AAAI Conference on Artificial Intelligence"],"original-title":[],"link":[{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/download\/4855\/4728","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/download\/4855\/4728","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,11,7]],"date-time":"2022-11-07T06:38:38Z","timestamp":1667803118000},"score":1,"resource":{"primary":{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/view\/4855"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,7,17]]},"references-count":0,"journal-issue":{"issue":"01","published-online":{"date-parts":[[2019,7,23]]}},"URL":"https:\/\/doi.org\/10.1609\/aaai.v33i01.33018401","relation":{},"ISSN":["2374-3468","2159-5399"],"issn-type":[{"value":"2374-3468","type":"electronic"},{"value":"2159-5399","type":"print"}],"subject":[],"published":{"date-parts":[[2019,7,17]]}}}