Time: 2025-07-23 | Counts: |
ZHANG B B, LI H B, MA Y C,et al.Few-shot action recognition in video method based on second-order spatiotemporal adaptation[J].Journal of Henan Polytechnic University(Natural Science) ,2025,44(5):43-51.
DOI:10.16186/j.cnki.1673-9787.2027070013
Received: 2024/07/02
Revised: 2024/09/19
Published:2025/07/23
Few-shot action recognition in video method based on second-order spatiotemporal adaptation
Zhang Bingbing, Li Haibo, Ma Yuanchen, Zhang Jianxin
School of Computer Science and Engineering, Dalian Minzu University, Dalian 116650, Liaoning, China
Abstract: Objectives In the field of few-shot video action recognition, existing methods generally face challenges in adequately processing global spatiotemporal information. These methods typically rely on large amounts of annotated data to train deep models, but with only a limited number of training samples available, they often struggle to effectively capture and utilize the spatiotemporal dynamics in video data. Methods To address this issue, an innovative second-order spatiotemporal adaptive network architecture including a spatiotemporal adaptive module and a covariance aggregation module was proposed to significantly enhance the accuracy and robustness of few-shot learning in video action recognition tasks. The spatiotemporal adaptive module dynamically aggregated local and global spatiotemporal information based on changes in video content, thereby optimizing the process of global information extraction. The covariance aggregation module utilized second-order statistical methods to enhance the global spatiotemporal feature representation of videos, providing a more robust global depiction of video content. Results Extensive experiments were conducted on four mainstream video action recognition benchmark datasets. The results demonstrated that the proposed method achieved accuracies of 52.2% and 72.4% for 1-shot and 5-shot tasks on the Something-Something V2 dataset, significantly outperforming the baseline model. Strong performance was also observed on Kinetics100, UCF101, and HMDB51 datasets, fully validating its effectiveness and practicality in few-shot video action recognition. Conclusions The proposed second-order spatiotemporal adaptive network effectively improved the accuracy and robustness of few-shot video action recognition. It demonstrated significant advantages in processing complex spatiotemporal information. This work provided an innovative and efficient solution addressing critical challenges in spatiotemporal modeling under limited data scenarios.
Key words:few-shot learning;action recognition in video;spatiotemporal representation learning;temporal modeling;covariance aggregation