Abstract:
The appearance of Deepfake technology makes it easy for people to tamper with face videos, which causes great harm to society. The existing tamper detection methods mainly focus on the detection of spatial feature changes in local face regions between video frames, and do not consider the time domain features of continuous global regions, and cannot detect subtle spatial feature changes in frames. To solve this problem, we propose a video tamper detection method ViT-3DCNN that combines Vision Transformer and 3DCNN. This method did not need to cut the face, and directly learned the continuous time-domain features between video frames and the spatial features of each frame. The experimental results show that the ViT-3DCNN model achieves 93.3% and 90.65% accuracy on DFDC dataset and Celeb-DF dataset respectively, independent of face clipping, which fully verifies the obvious advantages of the model over existing detection methods in terms of detection accuracy and generalization.