参数高效化微调的双分支视频动作识别方法-河南理工大学出版中心

>> 自然科学版 >> 当期目录 >> 正文

参数高效化微调的双分支视频动作识别方法

时间: 2025-06-19

次数:

王小伟, 沈燕飞, 邢庆君,等.参数高效化微调的双分支视频动作识别方法[J].河南理工大学学报（自然科学版）,2025,44(4):21-28.

WANG X W, SHEN Y F, XING Q J, et al. Two-branch video action recognition method based on high-efficiency parameter fine-tuning [J]. Journal of Henan Polytechnic University (Natural Science) , 2025, 44(4): 21-28.

参数高效化微调的双分支视频动作识别方法

王小伟1, 沈燕飞2, 邢庆君2

1.郑州大学体育学院体育大数据中心，河南郑州 450052；2.北京体育大学体育工程学院，北京 100084

摘要: 目的面向视频的AI智慧体育对于个性化训练、定制化运动分析具有重要的现实价值。现有的视频动作分析框架依赖于“预训练-微调”的范式将图像预训练模型迁移到视频时序建模中，然而，随着模型尺寸和预训练规模的不断扩大，一方面直接微调需更新全部参数导致计算成本高昂，另一方面难以基于图像大模型实现视频时空特征的建模。方法为此，提出一种基于大规模图像预训练模型的双分支视频动作识别框架TBN（two branch network），其包含时空解耦的双分支架构，分别处理静态背景特征和时序动态动作特征。在迁移中，预训练权重保持冻结，仅通过对额外增加的Prompt和Adaptor中的少量参数进行训练，实现从图像预训练模型到视频时序建模的参数高效化迁移。此外，针对现有基准数据集在高速运动场景的不足，构建一个大规模体育运动数据集Kinetics-Sports，包含42个运动类别（含篮球、滑冰、跨栏等），提供更严格的测试基准。结果在Kinetics-Sports，UCF101和HDBM51数据集上的实验结果表明，提出的方法在3个数据集上的识别准确率分别达到97.8%，78.0%，74.2%，优于目前几个数据集上最先进的方法，且参数量仅有12 MB，计算复杂度低于现有主流算法。结论提出的模型在精度-效率方面取得了更好的平衡，提升了体育运动动作检测的准确率和推理效率，为视觉大模型视频迁移提供了高效解决方案。

关键词:视频动作识别;预训练模型;参数高效化微调;双分支网络;时空建模

doi: 10.16186/j.cnki.1673-9787.2025020018

基金项目:国家自然科学基金资助项目（72071018）；河南省科技攻关计划项目（212102310264）

收稿日期:2025/02/18

修回日期:2025/05/11

出版日期:2025/06/19

Two-branch video action recognition method based on high-efficiency parameter fine-tuning

Wang Xiaowei1, Shen Yanfei2, Xing Qingjun2

1.Big Data Center， Physical Education College of Zhengzhou University， Zhengzhou 450052， Henan， China； 2.College of Sports Engineering， Beijing Sport University， Beijing 100084， China

Abstract: Objectives Video-oriented AI intelligent sports has important practical value for personalized training and customized sports analysis. Existing video motion analysis frameworks rely on the “pre-training then fine-tuning” paradigm to transfer image pre-training models to video timing modeling. However， with the continuous expansion of model size and pre-training scale， on the one hand， full-parameter updating through direct fine-tuning was demonstrated to cause high computational costs， On the other hand， effective modeling of spatiotemporal video features was shown to be unachievable when relying solely on large-scale image-based architectures. Methods Therefor， a two-branch video action recognition framework named TBN （two branch network） was proposed， which was constructed based on large-scale image pre-trained models. The architecture incorporated a spatiotemporally decoupled two-branch structure， where static background features and temporal dynamic motion features were separately processed through distinct computational pathways. During the migration process， the pre-trained weights remained frozen， while parameter-efficient transferring from the image pre-trained model to video temporal modeling was achieved through exclusive training of the minimally augmented parameters in both components of Prompt and Adaptor. Additionally， to address the limitations of existing benchmark datasets in high-speed motion scenarios， a large-scale sports dataset named Kinetics-Sports was constructed. The dataset comprised 42 sports categories （including basketball， ice skating， hurdling， etc.）， establishing a more rigorous testing benchmark for motion analysis. Results The experimental results on the Kinetics-Sports， UCF101， and HDBM51 datasets demonstrated that the proposed method achieved recognition accuracies of 97.8%， 78.0%， and 74.2% respectively across these three benchmarks， outperforming state-of-the-art approaches on the corresponding datasets. Furthermore， the framework was implemented with merely 12 M parameters and exhibited lower computational complexity compared to prevailing mainstream algorithms. Conclusions The proposed model achieved a more favorable balance between accuracy and efficiency， whereby the accuracy of sports action detection was enhanced and computational efficiency during inference was improved. This approach thereby provided an efficient solution for video transfer learning in prevailing large-scale vision models.

Key words: video action recognition; pre-training model; high efficient parameter fine-tuning; two-branch network; space-time modeling

附件【003_2025020018_王小伟_H.pdf】已下载次