融合VisionTransformer与3DCNN的深度伪造视频篡改检测

孙立信; 吴永飞; 李心宇; 任杰煌; 刘西林

doi:10.3969/j.issn.1000-386x.2025.11.015

融合VisionTransformer与3DCNN的深度伪造视频篡改检测

DEEPFAKE VIDEO TAMPERING DETECTION INTEGRATING VISION TRANSFORMER AND 3DCNN

摘要

摘要: Deepfake技术的出现，使人们可以轻松地对人脸视频进行篡改，对社会造成巨大的危害。现有的篡改检测方法主要侧重于视频帧间的局部人脸区域空间特征变化检测，并没有考虑连续全局区域的时域特征，且不能检测视频帧中的细微空域特征变化。针对此问题，提出融合Vision Transformer和3DCNN的视频篡改检测方法ViT-3DCNN。该方法无需对人脸进行裁剪，直接学习视频帧间的连续时域特征以及每一帧的空间特征。实验结果表明，不依赖于人脸剪裁的情况下，ViT-3DCNN模型分别在DFDC数据集及Celeb-DF数据集上取得了93.3%与90.65%的分类准确性，充分验证了该模型在检测精度和泛化性等方面相较于现有检测方法具有明显的优势。

Abstract: The appearance of Deepfake technology makes it easy for people to tamper with face videos, which causes great harm to society. The existing tamper detection methods mainly focus on the detection of spatial feature changes in local face regions between video frames, and do not consider the time domain features of continuous global regions, and cannot detect subtle spatial feature changes in frames. To solve this problem, we propose a video tamper detection method ViT-3DCNN that combines Vision Transformer and 3DCNN. This method did not need to cut the face, and directly learned the continuous time-domain features between video frames and the spatial features of each frame. The experimental results show that the ViT-3DCNN model achieves 93.3% and 90.65% accuracy on DFDC dataset and Celeb-DF dataset respectively, independent of face clipping, which fully verifies the obvious advantages of the model over existing detection methods in terms of detection accuracy and generalization.

HTML全文

参考文献(0)

施引文献

资源附件(0)