基于二阶时空自适应的小样本视频行为识别方法-河南理工大学出版中心

>> 自然科学版 >> 当期目录 >> 正文

基于二阶时空自适应的小样本视频行为识别方法

时间: 2025-07-23

次数:

张冰冰, 李海波, 马源晨,等.基于二阶时空自适应的小样本视频行为识别方法[J].河南理工大学学报（自然科学版）,2025,44(5):43-51.

ZHANG B B, LI H B, MA Y C,et al.Few-shot action recognition in video method based on second-order spatiotemporal adaptation[J].Journal of Henan Polytechnic University(Natural Science) ,2025,44(5):43-51.

基于二阶时空自适应的小样本视频行为识别方法

张冰冰, 李海波, 马源晨, 张建新

大连民族大学计算机科学与工程学院，辽宁大连 116650

摘要: 目的在小样本视频行为识别的研究领域中，现有方法普遍面临全局时空信息处理不足的挑战。这些方法通常依赖大量的标注数据训练深度模型，但在只有少量训练样本可用的情况下，它们往往难以有效捕捉和利用视频数据中的时空动态。方法针对此问题，提出一种新的包含时空自适应模块和协方差聚合模块的二阶时空自适应网络架构，以提升小样本学习在视频行为识别任务上的准确性和鲁棒性。时空自适应模块能根据视频内容的变化动态聚合局部和全局时空信息，从而优化全局信息的提取流程。协方差聚合模块利用二阶统计方法增强视频的全局时空特征表达，提供更加鲁棒的视频全局表征。结果在4个主流的视频行为识别基准数据集上进行广泛实验，结果表明，所提方法在Something-Something V2数据集上的1-shot和5-shot任务中，准确率分别达到52.2%和72.4%，显著超过基线模型。在Kinetics100，UCF101和HMDB51数据集上，同样表现出色，充分证明了其在小样本视频行为识别中的有效性和实用性。结论提出的二阶时空自适应网络有效提升了小样本视频行为识别的准确性和鲁棒性，特别是在处理复杂时空信息方面表现出显著优势，为该领域提供了一种创新且有效的解决方案。

关键词:小样本学习;视频行为识别;时空表征学习;时序建模;协方差聚合

DOI:10.16186/j.cnki.1673-9787.2027070013

基金项目:国家自然科学基金资助项目（61972062）；吉林省科技发展计划项目（20230201111GX）；辽宁省应用基础研究计划项目（2023JH2/101300191，2023JH2/101300193）；先进设计与智能计算省部共建教育部重点实验室开放课题（ADIC2023ZD003）

收稿日期:2024/07/02

修回日期:2024/09/19

出版日期:2025/07/23

Few-shot action recognition in video method based on second-order spatiotemporal adaptation

Zhang Bingbing, Li Haibo, Ma Yuanchen, Zhang Jianxin

School of Computer Science and Engineering， Dalian Minzu University， Dalian 116650， Liaoning， China

Abstract: Objectives In the field of few-shot video action recognition， existing methods generally face challenges in adequately processing global spatiotemporal information. These methods typically rely on large amounts of annotated data to train deep models， but with only a limited number of training samples available， they often struggle to effectively capture and utilize the spatiotemporal dynamics in video data. Methods To address this issue， an innovative second-order spatiotemporal adaptive network architecture including a spatiotemporal adaptive module and a covariance aggregation module was proposed to significantly enhance the accuracy and robustness of few-shot learning in video action recognition tasks. The spatiotemporal adaptive module dynamically aggregated local and global spatiotemporal information based on changes in video content， thereby optimizing the process of global information extraction. The covariance aggregation module utilized second-order statistical methods to enhance the global spatiotemporal feature representation of videos， providing a more robust global depiction of video content. Results Extensive experiments were conducted on four mainstream video action recognition benchmark datasets. The results demonstrated that the proposed method achieved accuracies of 52.2% and 72.4% for 1-shot and 5-shot tasks on the Something-Something V2 dataset， significantly outperforming the baseline model. Strong performance was also observed on Kinetics100， UCF101， and HMDB51 datasets， fully validating its effectiveness and practicality in few-shot video action recognition. Conclusions The proposed second-order spatiotemporal adaptive network effectively improved the accuracy and robustness of few-shot video action recognition. It demonstrated significant advantages in processing complex spatiotemporal information. This work provided an innovative and efficient solution addressing critical challenges in spatiotemporal modeling under limited data scenarios.

Key words:few-shot learning;action recognition in video;spatiotemporal representation learning;temporal modeling;covariance aggregation

附件【006_2027070013_张冰冰_L.pdf】已下载次