The threat posed by forged video technology has gradually grown to include individuals, society, and the nation. The technology behind fake videos is getting more advanced and modern. Fake videos are appearing everywhere on the internet. Consequently, addressing the challenge posed by frequent updates in various deepfake detection models is imperative. The substantial volume of data essential for their training adds to this urgency. For the deepfake detection problem, we suggest a cascade network based on spatial and channel reconstruction convolution (SCConv) and vision transformer. Our network model's front portion, which uses SCConv and regular convolution to detect fake videos in conjunction with vision transformer, comprises these two types of convolution. We enhance the feed-forward layer of the vision transformer, which can increase detection accuracy while lowering the model's computing burden. We processed the dataset by splitting frames and extracting faces to obtain many images of real and fake faces. Examinations conducted on the DFDC, FaceForensics++, and Celeb-DF datasets resulted in accuracies of 87.92, 99.23 and 99.98%, respectively. Finally, the video was tested for authenticity and good results were obtained, including excellent visualization results. Numerous studies also confirm the efficacy of the model presented in this study.
Citation: Xue Li, Huibo Zhou, Ming Zhao. Transformer-based cascade networks with spatial and channel reconstruction convolution for deepfake detection[J]. Mathematical Biosciences and Engineering, 2024, 21(3): 4142-4164. doi: 10.3934/mbe.2024183
The threat posed by forged video technology has gradually grown to include individuals, society, and the nation. The technology behind fake videos is getting more advanced and modern. Fake videos are appearing everywhere on the internet. Consequently, addressing the challenge posed by frequent updates in various deepfake detection models is imperative. The substantial volume of data essential for their training adds to this urgency. For the deepfake detection problem, we suggest a cascade network based on spatial and channel reconstruction convolution (SCConv) and vision transformer. Our network model's front portion, which uses SCConv and regular convolution to detect fake videos in conjunction with vision transformer, comprises these two types of convolution. We enhance the feed-forward layer of the vision transformer, which can increase detection accuracy while lowering the model's computing burden. We processed the dataset by splitting frames and extracting faces to obtain many images of real and fake faces. Examinations conducted on the DFDC, FaceForensics++, and Celeb-DF datasets resulted in accuracies of 87.92, 99.23 and 99.98%, respectively. Finally, the video was tested for authenticity and good results were obtained, including excellent visualization results. Numerous studies also confirm the efficacy of the model presented in this study.
[1] | V. Kumar, V. Kansal, M. Gaur, Multiple forgery detection in video using convolution neural network, Comput. Mater. Continua, 73 (2022), 1347–1364. https://doi.org/10.32604/cmc.2022.023545 doi: 10.32604/cmc.2022.023545 |
[2] | F. Ding, B. Fan, Z. Shen, K. Yu, G. Srivastava, K. Dev, et al., Securing facial bioinformation by eliminating adversarial perturbations, IEEE Trans. Ind. Inf., 19 (2023), 6682–6691. https://doi.org/10.1109/TII.2022.3201572 doi: 10.1109/TII.2022.3201572 |
[3] | A. Ilderton, Coherent quantum enhancement of pair production in the null domain, Phys. Rev. D, 101 (2020), 016006. https://doi.org/10.1103/physrevd.101.016006 doi: 10.1103/physrevd.101.016006 |
[4] | A. Ilderton, Lips don't lie: A generalisable and robust approach to face forgery detection, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 5039–5049. https://doi.org/10.1109/CVPR46437.2021.00500 |
[5] | N. Yu, L. Davis, M. Fritz, Attributing fake images to gans: Learning and analyzing gan fingerprints, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), (2019), 7556–7566. http://doi.org/10.1109/ICCV.2019.00765 |
[6] | N. Bonettini, E. D. Cannas, S. Mandelli, L. Bondi, P. Bestagini, S. Tubaro, Video face manipulation detection through ensemble of CNNs, in 2020 25th International Conference on Pattern Recognition (ICPR), (2021), 5012–5019. http://doi.org/10.1109/ICPR48806.2021.9412711 |
[7] | H. Zhao, T. Wei, W. Zhou, W. Zhang, D. Chen, N. Yu, Multi-attentional deepfake detection, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 2185–2194. http://doi.org/10.1109/CVPR46437.2021.00222 |
[8] | J. Li, Y. Wen, L. He, SCConv: Spatial and channel reconstruction convolution for feature redundancy, in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2023), 6153–6162. http://doi.org/10.1109/CVPR52729.2023.00596 |
[9] | J. Kim, J. K. Lee, K. M. Lee, Accurate image super-resolution using very deep convolutional networks, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), 1646–1654. http://doi.org/10.1109/CVPR.2016.182 |
[10] | E. Zakharov, A. Shysheya, E. Burkov, V. Lempitsky, Few-shot adversarial learning of realistic neural talking head models, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), (2019), 9458–9467. http://doi.org/10.1109/ICCV.2019.00955 |
[11] | R. Haridas, L. Jyothi, Convolutional neural networks: A comprehensive survey, Int. J. Appl. Eng. Res., 14 (2019), 780. http://doi.org/10.37622/IJAER/14.3.2019.780-789 doi: 10.37622/IJAER/14.3.2019.780-789 |
[12] | K. R. Prajwal, R. Mukhopadhyay, P. J. Philip, A. Jha, V. Namboodiri, C. V. Jawahar, Towards automatic face-to-face translation, in Proceedings of the 27th ACM International Conference on Multimedia, 2019. http://doi.org/10.1145/3343031.3351066 |
[13] | K. R. Prajwal, R. Mukhopadhyay, V. Namboodiri, C. V. Jawahar, A lip sync expert is all you need for speech to lip generation in the wild, in Proceedings of the 28th ACM International Conference on Multimedia, 2020. http://doi.org/10.1145/3394171.3413532 |
[14] | Y. Nirkin, L. Wolf, Y. Keller, T. Hassner, DeepFake detection based on discrepancies between faces and their context, IEEE Trans. Pattern Anal. Mach. Intell., 44 (2022), 6111–6121. http://doi.org/10.1109/TPAMI.2021.3093446 doi: 10.1109/TPAMI.2021.3093446 |
[15] | Z. Xu, Z. Hong, C. Ding, Z. Zhu, J. Han, J. Liu, et al., Mobilefaceswap: A lightweight framework for video face swappingg, preprint, arXiv: 2201.03808. https://doi.org/10.48550/arXiv.2005.07034 |
[16] | T. Wang, Z. Li, R. Liu, Y. Wang, L. Nie, An efficient attribute-preserving framework for face swapping, IEEE Trans. Multimedia, 44 (2024), 1–13. http://doi.org/10.1109/TMM.2024.3354573 doi: 10.1109/TMM.2024.3354573 |
[17] | B. Peng, H. Fan, W. Wang, J. Dong, S. Lyu, A unified framework for high fidelity face swap and expression reenactment, IEEE Trans. Circuits Syst. Video Technol., 32 (2022), 3673–3684. http://doi.org/10.1109/TCSVT.2021.3106047 doi: 10.1109/TCSVT.2021.3106047 |
[18] | H. Zhou, Y. Sun, W. Wu, C. C. Loy, X. Wang, Z. Liu, Pose-controllable talking face generation by implicitly modularized audio-visual representation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 4174–4184. http://doi.org/10.1109/CVPR46437.2021.00416 |
[19] | N. Van Huynh, D. T. Hoang, D. N. Nguyen, E. Dutkiewicz, DeepFake: Deep dueling-based deception strategy to defeat reactive jammers, IEEE Trans. Wireless Commun., 20 (2021), 6898–6914. https://doi.org/10.1109/TWC.2021.3078439 doi: 10.1109/TWC.2021.3078439 |
[20] | A. Hamza, A.R. R. Javed, F. Iqbal, N. Kryvinska, A. S. Almadhor, Z. Jalil, et al., Deepfake audio detection via MFCC features using machine learning, IEEE Access, 10 (2022), 134018–134028. http://doi.org/10.1109/ACCESS.2022.3231480 doi: 10.1109/ACCESS.2022.3231480 |
[21] | S. Bounareli, C. Tzelepis, V. Argyriou, I. Patras, G. Tzimiropoulos, HyperReenact: one-shot reenactment via jointly learning to refine and retarget faces, in 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 10 (2023), 7115–7125. http://doi.org/10.1109/ICCV51070.2023.00657 |
[22] | F. T. Hong, L. Shen, D. Xu, Depth-aware generative adversarial network for talking head video generation, IEEE Trans. Pattern Anal. Mach. Intell., 10 (2023), 1–15. http://doi.org/10.1109/TPAMI.2023.3339964 doi: 10.1109/TPAMI.2023.3339964 |
[23] | N. Liu, F. Zhang, L. Chang, F. Duan, Scattering-based hybrid network for facial attribute classification, Front. Comput. Sci., 10 (2024). http://doi.org/10.1007/s11704-023-2570-6 doi: 10.1007/s11704-023-2570-6 |
[24] | Y. Xu, Y. Yin, L. Jiang, Q. Wu, C. Zheng, C. C. Loy, et al., Transeditor: Transformer-based dual-space gan for highly controllable facial editing, preprint, arXiv: 2203.17266. https://doi.org/10.48550/arXiv.2203.17266 |
[25] | J. Sun, X. Wang, Y. Zhang, X. Li, Q. Zhang, Y. Liu, et al., Fenerf: Face editing in neural radiance fields, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), 7662–7672. http://doi.org/10.1109/CVPR52688.2022.00752 |
[26] | K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), 770–778. http://doi.org/10.1109/CVPR.2016.90 |
[27] | G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolutional networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), 2261–2269. http://doi.org/10.1109/CVPR.2017.243 |
[28] | F. Chollet, Xception: Deep learning with depthwise separable convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), 1800–1807. http://doi.org/10.1109/CVPR.2017.195 |
[29] | M. Tan, Q. V. Le, Efficientnet: Rethinking model scaling for convolutional neural networks, in International Conference on Machine Learning, PMLR, (2019), 6105–6114. |
[30] | D. Afchar, V. Nozick, J. Yamagishi, I. Echizen, Mesonet: a compact facial video forgery detection network, in 2018 IEEE Iinternational Workshop on Information Forensics and Security (WIFS), 2018. http://doi.org/10.1109/wifs.2018.8630761 |
[31] | T. Zhao, X. Xu, M. Xu, H. Ding, Y. Xiong, W. Xia, Learning self-consistency for deepfake detection, in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (2021), 15003–15013. http://doi.org/10.1109/ICCV48922.2021.01475 |
[32] | K. Shiohara, T. Yamasaki, Detecting deepfakes with self-blended images, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), 18699–18708. http://doi.org/10.1109/CVPR52688.2022.01816 |
[33] | Z. Yan, Y. Zhang, Y. Fan, B. Wu, UCF: Uncovering common features for generalizable deepfake detection, preprint, arXiv: 2304.13949. https://doi.org/10.48550/arXiv.2304.13949 |
[34] | Y. Xu, K. Raja, L. Verdoliva, M. Pedersen, Learning pairwise interaction for generalizable deepFake detection, preprint, arXiv: 2302.13288. https://doi.org/10.48550/arXiv.2302.13288 |
[35] | B. Huang, Z. Wang, J. Yang, J. Ai, Q. Zou, Q. Wang, et al., Implicit identity driven deepfake face swapping detection, in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2023), 4490–4499. https://doi.org/10.1109/CVPR52729.2023.00436 |
[36] | Y. Lai, Z. Luo, Z. Yu, Detect any deepfakes: Segment anything meets face forgery detection and localization, preprint, arXiv: 2306.17075. https://doi.org/10.48550/arXiv.2306.17075 |
[37] | Y. Zhu, C. Zhang, J.Gao, X. Sun, Z. Rui, X. Zhou, High-compressed deepfake video detection with contrastive spatiotemporal distillation, Neurocomputing, 565 (2024), 126872. https://doi.org/10.1016/j.neucom.2023.126872 doi: 10.1016/j.neucom.2023.126872 |
[38] | L. Deng, J. Wang, Z. Liu, Cascaded network based on efficientNet and transformer for deepfake video detection, Neural Process. Lett., 55 (2023). http://doi.org/10.1007/s11063-023-11249-6 doi: 10.1007/s11063-023-11249-6 |
[39] | A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, M. Niessner, Faceforensics++: Learning to detect manipulated facial images, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), (2019), 1–11. http://doi.org/10.1109/ICCV.2019.00009 |
[40] | B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, M. Wang, et al., The deepfake detection challenge (DFDC) dataset, preprint, arXiv: 2006.07397. https://doi.org/10.48550/arXiv.2006.07397 |
[41] | Y. Li, X. Yang, P. Sun, H. Qi, S. Lyu, Celeb-df: A large-scale challenging dataset for deepfake forensics, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020), 3204–3213. http://doi.org/10.1109/CVPR42600.2020.00327 |
[42] | V. Bazarevsky, Y. Kartynnik, A. Vakunov, K. Raveendran, M.Grundmann, Blazeface: Sub-millisecond neural face detection on mobile GPUs, preprint, arXiv: 1907.05047. https://doi.org/10.48550/arXiv.1907.05047 |
[43] | M. Diganta, Mish: A self regularized non-monotonic activation function, preprint, arXiv: 1908.08681. https://doi.org/10.48550/arXiv.1908.08681 |
[44] | W. Wang, E. Xie, X. Li, D. Fan, K. Song, D. Liang, et al., Pvt v2: Improved baselines with pyramid vision transformer, Comput. Visual Media, 8 (2022), 415–424. https://doi.org/10.1007/s41095-022-0274-8 doi: 10.1007/s41095-022-0274-8 |
[45] | R. Congalton, Accuracy assessment and validation of remotely sensed and other spatial information, Int. J. Wildland Fire, 10 (2001), 321–328. http://doi.org/10.1071/WF01031 doi: 10.1071/WF01031 |
[46] | J. Wang, Z. Wu, W. Ouyang, X, Han, J, Chen, S. Lim, et al., M2tr: Multi-modal multi-scale transformers for deepfake detection, preprint, arXiv: 2104.09770. https://doi.org/10.48550/arXiv.2104.09770 |
[47] | Z. Cai, S. Ghosh, K. Stefanov, A. Dhall, J. Cai, H. Rezatofighi, et al., Marlin: Masked autoencoder for facial video representation learning, preprint, arXiv: 2211.06627. https://doi.org/10.48550/arXiv.2211.06627 |
[48] | Y. Xu, J. Liang, G. Jia, Z. Yang, Y. Zhang, R. He, TALL: Thumbnail layout for deepfake video detection, preprint, arXiv: 2307.07494. https://doi.org/10.48550/arXiv.2307.07494 |
[49] | L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, et al., Face X-Ray for more general face forgery detection, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020), 5000–5009. https://doi.org/10.1109/CVPR42600.2020.00505 |
[50] | A. Haliassos, R. Mira, S. Petridis, M. Pantic, Leveraging real talking faces via self-supervision for robust forgery detection, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), 14930–14942. https://doi.org/10.1109/CVPR52688.2022.01453 |
[51] | Y. Zhang, X. Li, C. Liu, B. Shuai, Y. Zhu, B. Brattoli, et al., Vidtr: Video transformer without convolutions, in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (2020), 13557–13567. https://doi.org/10.1109/ICCV48922.2021.01332 |
[52] | C. Zhao, C. Wang, G. Hu, H. Chen, C. Liu, J. Tang, ISTVT: interpretable spatial-temporal video transformer for deepfake detection, IEEE Trans. Inf. Forensics Secur., (2023), 1335–1348. https://doi.org/10.1109/TIFS.2023.3239223 |
[53] | D. Neimark, O. Bar, M. Zohar, D. Asselmann, Video transformer network, in 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), (2023), 3156–3165. https://doi.org/10.1109/ICCVW54120.2021.00355 |
[54] | B. Chen, T. Li, W. Ding, Detecting deepfake videos based on spatiotemporal attention and convolutional LSTM, Inf. Sci., 601 (2022), 58–70. https://doi.org/10.1016/j.ins.2022.04.014 doi: 10.1016/j.ins.2022.04.014 |