基于R(2+1)D时空特征融合与注意力的行为识别方法

李林玉，陈淑荣

doi:10.3969/j.issn.1000-386x.2026.02.032

基于R(2+1)D时空特征融合与注意力的行为识别方法

李林玉，陈淑荣,

BEHAVIOR RECOGNITION METHOD BASED ON R(2+1)D SPATIO-TEMPORAL FEATURE FUSION WITH ATTENTION

摘要

摘要: 针对3D卷积在人体行为识别任务中，连续视频帧图像的时空信息提取不足且跨通道交互信息关注度不够，导致识别准确率不高的问题，提出一种基于R(2+1)D网络的多分路时空信息融合与注意力的行为识别方法。提取视频帧图像进行数据增强；以R(2+1)D网络为基础框架并融入Inception思想，对输入的视频帧图像进行多路时空特征卷积并融合，利用ECA通道注意力对融合特征筛选跨通道交互信息，以提取更抽象的高层特征；进行分类，输出人体行为识别结果。该方法充分利用视频的时空特征和跨通道交互信息，在UCF101数据集上准确率达到94.71%，比基础R(2+1)D网络提高4.53百分点；且模型参数由原来的33.3×10⁶减小为26.9×10⁶。实验表明，该方法能有效提高人体行为识别的准确率。

Abstract: To address the problem of insufficient extraction of Spatio-temporal information from continuous video frame images and insufficient attention to cross-channel interaction information in 3D convolution in a human behavior recognition task, a behavior recognition method based on R(2+1)D network with multi-partition spatio-temporal information fusion and attention is proposed. The video frame images were extracted for data enhancement. The R(2+1)D network was used as the basic framework and incorporated with the Inception idea to convolve and fuse the input video frame images with multiple Spatio-temporal features, and the fused features were screened for cross-channel interaction information using ECA channel attention to extract more abstract high-level features. The classification was performed and the human behavior recognition results were output. The method made full use of the Spatio-temporal features and cross-channel interaction information of the video, and achieved an accuracy of 94.71% on the UCF101 dataset, which was 4.53 percentage points higher than the basic R(2+1)D network; and the model parameters were reduced from 33.3M to 26.9M. Experiments show that the method can effectively improve the accuracy of human behavior recognition.

HTML全文

参考文献(0)

施引文献

资源附件(0)