基于连续帧信息融合建模的小样本视频行为识别方法-河南理工大学出版中心

>> 自然科学版 >> 当期目录 >> 正文

基于连续帧信息融合建模的小样本视频行为识别方法

时间: 2025-06-19

次数:

张冰冰, 李海波, 马源晨,等.基于连续帧信息融合建模的小样本视频行为识别方法[J].河南理工大学学报（自然科学版）,2025,44(4):11-20.

ZHANG B B, LI H B, MA Y C, et al. Few-shot action recognition in video method based on continuous fame information fusion modeling [J]. Journal of Henan Polytechnic University (Natural Science) , 2025, 44(4): 11-20.

基于连续帧信息融合建模的小样本视频行为识别方法

张冰冰, 李海波, 马源晨, 张建新

大连民族大学计算机科学与工程学院，辽宁大连 116650

摘要: 目的为克服现有基于小样本学习的视频行为识别方法在全局时空信息获取及复杂行为建模方面的局限，开发一种新型网络架构，以显著提升小样本学习在视频行为识别中的准确性和鲁棒性。方法提出一种结合连续帧信息融合模块和多维注意力建模模块的网络架构。连续帧信息融合模块位于网络的输入端，主要负责捕获并转化低级信息，以输出更丰富的高级语义信息，以此加深对上下文的理解。多维注意力建模模块则设置在网络的中间层，旨在解决时空特征信息建模不足的问题，进而提升模型对时空关系的捕捉能力。此外，整个网络基于2D卷积模型设计，可有效降低计算复杂度。结果在Something-Something V2，Kinetics-100，UCF101和HMDB51共4个主流行为识别数据集上进行实验，结果表明，所提方法在Something-Something V2数据集上的1-shot和5-shot任务中准确率分别达到50.8%和68.5%；在Kinetics-100数据集上，所提方法的1-shot和5-shot任务准确率分别为68.5%和83.8%，比现有方法显著提升；在UCF101数据集上，本文方法的1-shot任务准确率为81.3%，5-shot任务准确率为93.8%，在不同配置下均显著优于基线方法的；在HMDB51数据集上，1-shot任务的准确率为56.0%，5-shot任务的准确率为74.4%，展现了良好的泛化性能。结论实验结果验证了提出的连续帧信息融合建模网络在小样本视频行为识别中的有效性，尤其在提高模型对复杂时空信息处理能力方面表现出显著优势。本文解决方案为小样本视频行为识别领域带来了有效的新方法，且具有高效性和实用性。

关键词:小样本学习;视频行为识别;时空建模;时空表征学习;连续帧信息

doi: 10.16186/j.cnki.1673-9787.2024070012

基金项目:国家自然科学基金资助项目（61972062）；吉林省科技发展计划项目（20230201111GX）；辽宁省应用基础研究计划项目（2023JH2/101300191，2023JH2/101300193）；先进设计与智能计算省部共建教育部重点实验室开放课题（ADIC2023ZD003）

收稿日期:2024/07/02

修回日期:2024/09/19

出版日期:2025/06/19

Few-shot action recognition in video method based on continuous fame information fusion modeling

Zhang Bingbing, Li Haibo, Ma Yuanchen, Zhang Jianxin

Computer Science and Engineering College， Dalian Minzu University， Dalian 116650， Liaoning， China

Abstract: Objectives To overcome the limitations of existing few-shot video action recognition methods in capturing global spatiotemporal information and modeling complex behaviors， a new network architecture was developed to significantly enhances the accuracy and robustness of few-shot learning in video action recognition tasks. Methods A network architecture was presented integrating a continuous frame information fusion module and a multi-dimensional attention modeling module. The continuous frame information fusion module was positioned at the input end of the network， primarily responsible for capturing and transforming low-level information into richer high-level semantic information， thereby deepening the understanding of the context. The multi-dimensional attention modeling module was set in the middle layer of the network， aiming at addressing the inadequacies in spatiotemporal feature information modeling and enhancing the model’s capability to capture spatiotemporal relationships. Additionally， the entire network was designed based on a 2D convolutional model， effectively reducing computational complexity. Results Experiments on four mainstream action recognition datasetsshowed that， on the Something-Something V2 dataset， the accuracy rates for 1-shot and 5-shot tasks reached 50.8% and 68.5%， respectively； on the Kinetics-100 dataset， the 1-shot and 5-shot tasks achieved accuracy rates of 68.5% and 83.8%， respectively， showing significant improvement over existing methods； on the UCF101 dataset， the method achieved an accuracy rate of 81.3% for the 1-shot task and 93.8% for the 5-shot task， both markedly superior to baseline methods. Additionally， on the HMDB51 dataset， the method demonstrated good generalization performance， with accuracy rates of 56.0% for the 1-shot task and 74.4% for the 5-shot task. Conclusions Experimental results confirmed the effectiveness of the proposed continuous frame integration modeling network in few-shot video action recognition， particularly in enhancing the model’s ability to process complex spatiotemporal information. The solutions presented in this study could introduce effective new methods to the field of few-shot action recognition， demonstrating their efficiency and practicality.

Key words: few-shot learning; video action recognition; spatiotemporal modeling; spatiotemporal representation learning; continuous frame information

附件【002_2024070012_张冰冰_H.pdf】已下载次