Time: 2025-06-19 | Counts: |
WANG X W, SHEN Y F, XING Q J, et al. Two-branch video action recognition method based on high-efficiency parameter fine-tuning [J]. Journal of Henan Polytechnic University (Natural Science) , 2025, 44(4): 21-28.
doi: 10.16186/j.cnki.1673-9787.2025020018
Received:2025/02/18
Revised:2025/05/11
Published:2025/06/19
Two-branch video action recognition method based on high-efficiency parameter fine-tuning
Wang Xiaowei1, Shen Yanfei2, Xing Qingjun2
1.Big Data Center, Physical Education College of Zhengzhou University, Zhengzhou 450052, Henan, China; 2.College of Sports Engineering, Beijing Sport University, Beijing 100084, China
Abstract: Objectives Video-oriented AI intelligent sports has important practical value for personalized training and customized sports analysis. Existing video motion analysis frameworks rely on the “pre-training then fine-tuning” paradigm to transfer image pre-training models to video timing modeling. However, with the continuous expansion of model size and pre-training scale, on the one hand, full-parameter updating through direct fine-tuning was demonstrated to cause high computational costs, On the other hand, effective modeling of spatiotemporal video features was shown to be unachievable when relying solely on large-scale image-based architectures. Methods Therefor, a two-branch video action recognition framework named TBN (two branch network) was proposed, which was constructed based on large-scale image pre-trained models. The architecture incorporated a spatiotemporally decoupled two-branch structure, where static background features and temporal dynamic motion features were separately processed through distinct computational pathways. During the migration process, the pre-trained weights remained frozen, while parameter-efficient transferring from the image pre-trained model to video temporal modeling was achieved through exclusive training of the minimally augmented parameters in both components of Prompt and Adaptor. Additionally, to address the limitations of existing benchmark datasets in high-speed motion scenarios, a large-scale sports dataset named Kinetics-Sports was constructed. The dataset comprised 42 sports categories (including basketball, ice skating, hurdling, etc.), establishing a more rigorous testing benchmark for motion analysis. Results The experimental results on the Kinetics-Sports, UCF101, and HDBM51 datasets demonstrated that the proposed method achieved recognition accuracies of 97.8%, 78.0%, and 74.2% respectively across these three benchmarks, outperforming state-of-the-art approaches on the corresponding datasets. Furthermore, the framework was implemented with merely 12 M parameters and exhibited lower computational complexity compared to prevailing mainstream algorithms. Conclusions The proposed model achieved a more favorable balance between accuracy and efficiency, whereby the accuracy of sports action detection was enhanced and computational efficiency during inference was improved. This approach thereby provided an efficient solution for video transfer learning in prevailing large-scale vision models.
Key words: video action recognition; pre-training model; high efficient parameter fine-tuning; two-branch network; space-time modeling