Time: 2025-06-19 | Counts: |
ZHANG B B, LI H B, MA Y C, et al. Few-shot action recognition in video method based on continuous fame information fusion modeling [J]. Journal of Henan Polytechnic University (Natural Science) , 2025, 44(4): 11-20.
doi: 10.16186/j.cnki.1673-9787.2024070012
Received:2024/07/02
Revised:2024/09/19
Published:2025/06/19
Few-shot action recognition in video method based on continuous fame information fusion modeling
Zhang Bingbing, Li Haibo, Ma Yuanchen, Zhang Jianxin
Computer Science and Engineering College, Dalian Minzu University, Dalian 116650, Liaoning, China
Abstract: Objectives To overcome the limitations of existing few-shot video action recognition methods in capturing global spatiotemporal information and modeling complex behaviors, a new network architecture was developed to significantly enhances the accuracy and robustness of few-shot learning in video action recognition tasks. Methods A network architecture was presented integrating a continuous frame information fusion module and a multi-dimensional attention modeling module. The continuous frame information fusion module was positioned at the input end of the network, primarily responsible for capturing and transforming low-level information into richer high-level semantic information, thereby deepening the understanding of the context. The multi-dimensional attention modeling module was set in the middle layer of the network, aiming at addressing the inadequacies in spatiotemporal feature information modeling and enhancing the model’s capability to capture spatiotemporal relationships. Additionally, the entire network was designed based on a 2D convolutional model, effectively reducing computational complexity. Results Experiments on four mainstream action recognition datasetsshowed that, on the Something-Something V2 dataset, the accuracy rates for 1-shot and 5-shot tasks reached 50.8% and 68.5%, respectively; on the Kinetics-100 dataset, the 1-shot and 5-shot tasks achieved accuracy rates of 68.5% and 83.8%, respectively, showing significant improvement over existing methods; on the UCF101 dataset, the method achieved an accuracy rate of 81.3% for the 1-shot task and 93.8% for the 5-shot task, both markedly superior to baseline methods. Additionally, on the HMDB51 dataset, the method demonstrated good generalization performance, with accuracy rates of 56.0% for the 1-shot task and 74.4% for the 5-shot task. Conclusions Experimental results confirmed the effectiveness of the proposed continuous frame integration modeling network in few-shot video action recognition, particularly in enhancing the model’s ability to process complex spatiotemporal information. The solutions presented in this study could introduce effective new methods to the field of few-shot action recognition, demonstrating their efficiency and practicality.
Key words: few-shot learning; video action recognition; spatiotemporal modeling; spatiotemporal representation learning; continuous frame information