Abstract:
To address the problem of insufficient extraction of Spatio-temporal information from continuous video frame images and insufficient attention to cross-channel interaction information in 3D convolution in a human behavior recognition task, a behavior recognition method based on R(2+1)D network with multi-partition spatio-temporal information fusion and attention is proposed. The video frame images were extracted for data enhancement. The R(2+1)D network was used as the basic framework and incorporated with the Inception idea to convolve and fuse the input video frame images with multiple Spatio-temporal features, and the fused features were screened for cross-channel interaction information using ECA channel attention to extract more abstract high-level features. The classification was performed and the human behavior recognition results were output. The method made full use of the Spatio-temporal features and cross-channel interaction information of the video, and achieved an accuracy of 94.71% on the UCF101 dataset, which was 4.53 percentage points higher than the basic R(2+1)D network; and the model parameters were reduced from 33.3M to 26.9M. Experiments show that the method can effectively improve the accuracy of human behavior recognition.