Traditional predictive models, often used in simpler settings, face issues like high latency and computational demands, especially in complex real-world environments. Recent progress in deep learning has advanced spatiotemporal prediction research, yet challenges persist in general scenarios: (ⅰ) Latency and computational load of models; (ⅱ) dynamic nature of real-world environments; (ⅲ) complex motion and monitoring scenes. To overcome these challenges, we introduced a novel spatiotemporal prediction framework. It replaced high-latency recurrent models with fully convolutional ones, improving inference speed. Furthermore, it addressed the dynamic nature of environments with multilevel frequency domain encoders and decoders, facilitating spatial and temporal learning. For complex monitoring scenarios, a large receptive field token mixer spatial-frequency attention units (SAU) and time attention units (TAU) ensured temporal and spatial continuity. This framework outperformed current methods in accuracy and speed on public datasets, showing promising practical applications beyond electricity monitoring.
Citation: Rui Han, Shuaiwei Liang, Fan Yang, Yong Yang, Chen Li. Fully convolutional video prediction network for complex scenarios[J]. Electronic Research Archive, 2024, 32(7): 4321-4339. doi: 10.3934/era.2024194
Traditional predictive models, often used in simpler settings, face issues like high latency and computational demands, especially in complex real-world environments. Recent progress in deep learning has advanced spatiotemporal prediction research, yet challenges persist in general scenarios: (ⅰ) Latency and computational load of models; (ⅱ) dynamic nature of real-world environments; (ⅲ) complex motion and monitoring scenes. To overcome these challenges, we introduced a novel spatiotemporal prediction framework. It replaced high-latency recurrent models with fully convolutional ones, improving inference speed. Furthermore, it addressed the dynamic nature of environments with multilevel frequency domain encoders and decoders, facilitating spatial and temporal learning. For complex monitoring scenarios, a large receptive field token mixer spatial-frequency attention units (SAU) and time attention units (TAU) ensured temporal and spatial continuity. This framework outperformed current methods in accuracy and speed on public datasets, showing promising practical applications beyond electricity monitoring.
[1] | J. Hu, B. Guo, W. Yan, J. Lin, C. Li, Y. Yan, A classification model of power operation inspection defect texts based on graph convolutional network, Front. Energy Res., 10 (2022), 1028607. https://doi.org/10.3389/fenrg.2022.1028607 doi: 10.3389/fenrg.2022.1028607 |
[2] | Y. Yan, Y. Han, D. Qi, J. Lin, Z. Yang, L. Jin, Multi-label image recognition for electric power equipment inspection based on multi-scale dynamic graph convolution network, Energy Rep., 9 (2023), 1928–1937. https://doi.org/10.1016/j.egyr.2023.04.152 doi: 10.1016/j.egyr.2023.04.152 |
[3] | Z. Chang, X. Zhang, S. Wang, S. Ma, W. Gao, STRPM: A spatiotemporal residual predictive model for high-resolution video prediction, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), 13946–13955. |
[4] | R. Villegas, J. Yang, S. Hong, X. Lin, H. Lee, Decomposing motion and content for natural video sequence prediction, preprint, arXiv: 1706.08033. |
[5] | Y. Zhang, Y. Yan, G. Feng, Feature compensation network based on non-uniform quantization of channels for digital image global manipulation forensics, Signal Process. Image Commun., 107 (2022), 116795. https://doi.org/10.1016/j.image.2022.116795 doi: 10.1016/j.image.2022.116795 |
[6] | T. Tao, K. Long, T. Yang, S. Liu, Y. Yang, X. Guo, et al., Quantitative assessment on fatigue damage induced by wake effect and yaw misalignment for floating offshore wind turbines, Ocean Eng., 288 (2023), 116004. https://doi.org/10.1016/j.oceaneng.2023.116004 doi: 10.1016/j.oceaneng.2023.116004 |
[7] | T. Tao, Y. Liu, Y. Qiao, L. Gao, J. Lu, C. Zhang, et al., Wind turbine blade icing diagnosis using hybrid features and Stacked-XGBoost algorithm, Renewable Energy, 180 (2021), 1004–1013. https://doi.org/10.1016/j.renene.2021.09.008 doi: 10.1016/j.renene.2021.09.008 |
[8] | D. Ha, J. Schmidhuber, Recurrent world models facilitate policy evolution, Advances in Neural Information Processing Systems, 31 (2018). |
[9] | S. Gidaris, P. Singh, N. Komodakis, Unsupervised representation learning by predicting image rotations, preprint, arXiv: 1803.07728. |
[10] | D. Song, Y. Yang, S. Zheng, X. Deng, J. Yang, M. Su, et al., New perspectives on maximum wind energy extraction of variable-speed wind turbines using previewed wind speeds, Energy Convers. Manage., 206 (2020), 112496. https://doi.org/10.1016/j.enconman.2020.112496 doi: 10.1016/j.enconman.2020.112496 |
[11] | A. Gupta, S. Tian, Y. Zhang, J. Wu, R. Martín-Martín, F. Li, Maskvit: Masked visual pre-training for video prediction, preprint, arXiv: 2206.11894. |
[12] | A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, C. Schmid, ViViT: A video vision transformer, in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), 6836–6846. |
[13] | K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked autoencoders are scalable vision learners, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), 16000–16009. |
[14] | C. Tan, S. Li, Z. Gao, W. Guan, Z. Wang, Z. Liu, et al., OpenSTL: A comprehensive benchmark of spatio-temporal predictive learning, preprint, arXiv: 2306.11249. |
[15] | S. Jenni, G. Meishvili, P. Favaro, Video representation learning by recognizing temporal transformations, in European Conference on Computer Vision, (2020), 425–442. |
[16] | L. Castrejon, N. Ballas, A. Courville, Improved conditional VRNNs for video prediction, in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2019), 7608–7617. |
[17] | X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, W. Woo, Convolutional LSTM network: A machine learning approach for precipitation nowcasting, in Advances in Neural Information Processing Systems, 28 (2015). |
[18] | Y. Wang, M. Long, J. Wang, Z. Gao, P. S. Yu, PredRNN: Recurrent neural networks for predictive learning using spatiotemporal LSTMS, in Advances in Neural Information Processing Systems, 30 (2017). |
[19] | Y. Wang, Z. Gao, M. Long, J. Wang, S. Y. Philip, PredRNN++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning, in International Conference on Machine Learning, (2018), 5123–5132. |
[20] | Y. Wang, L. Jiang, M. H. Yang, L. Li, M. Long, F. Li, Eidetic 3D LSTM: A model for video prediction and beyond, in International Conference on Learning Representations, 2018. |
[21] | V. Le Guen, N. Thome, Disentangling physical dynamics from unknown factors for unsupervised video prediction, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), 11474–11484. |
[22] | F. Ebert, C. Finn, A. X. Lee, S. Levine, Self-supervised visual planning with temporal skip connections, CoRL, 12 (2017), 16. |
[23] | S. Liu, Z. Lin, Y. Zhao, Y. Liu, Y. Ding, B. Zhang, et al., Robust system separation strategy considering online wide-area coherency identification and uncertainties of renewable energy sources, IEEE Trans. Power Syst., 35 (2020), 3574–3587. https://doi.org/10.1109/TPWRS.2020.2971966 doi: 10.1109/TPWRS.2020.2971966 |
[24] | Y. Wang, J. Zhang, H. Zhu, M. Long, J. Wang, P. S. Yu, Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), 9154–9162. |
[25] | M. Chen, Y. Wang, Y. Dai, Y. Yan, D. Qi, Small and strong: Power line segmentation network in real time based on self-supervised learning, Proc. CSEE, 42 (2022), 1365–1375. |
[26] | X. Wang, L. Luo, L. Tang, Z. Yang, Automatic representation and detection of fault bearings in in-wheel motors under variable load conditions, Adv. Eng. Inf., 49 (2021), 101321. https://doi.org/10.1016/j.aei.2021.101321 doi: 10.1016/j.aei.2021.101321 |
[27] | I. Xu, Y. Wang, M. Long, J. Wang, PredCNN: Predictive learning with cascade convolutions, in IJCAI, (2018), 2940–2947. |
[28] | J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2018), 7132–7141. |
[29] | T. Yao, Y. Pan, Y. Li, C. Ngo, T. Mei, Wave-ViT: Unifying wavelet and transformers for visual representation learning, in European Conference on Computer Vision, (2022), 328–345. |
[30] | W. Lotter, G. Kreiman, D. Cox, Deep predictive coding networks for video prediction and unsupervised learning, preprint, arXiv: 1605.08104. |
[31] | W. Yu, M. Luo, P. Zhou, C. Si, Y. Zhou, X. Wang, et al., Metaformer is actually what you need for vision, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), 10819–10829. |
[32] | Z. Gao, C. Tan, L. Wu, S. Z. Li, SimVP: Simpler yet better video prediction, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), 3170–3180. |
[33] | X. Nie, X. Chen, H. Jin, Z. Zhu, Y. Yan, D. Qi, Triplet attention transformer for spatiotemporal predictive learning, in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, (2024), 7036–7045. |
[34] | G. Jocher, A. Chaurasia, A. Stoken, J. Borovec, Y. Kwon, K. Michael, et al., ultralytics/yolov5: v7.0-YOLOv5 sota realtime instance segmentation, Zenodo, 2022. |
[35] | C. Tan, Z. Gao, S. Li, S. Z. Li, SimVP: Towards simple yet powerful spatiotemporal predictive learning, preprint, arXiv: 2211.12509. |
[36] | D. Hendrycks, K. Gimpel, Gaussian error linear units (GELUs), preprint, arXiv: 1606.08415. |
[37] | J. Lee, J. Lee, S. Lee, S. Yoon, Mutual suppression network for video prediction using disentangled features, preprint, arXiv: 1804.04810. |