Digital signal processing (DSP) is an important technology in various research fields. However, mainstream implementations of DSP algorithms either lack flexibility and scalability or have performance saturation and bottleneck, which still have the potential to be optimized. In this work, we proposed DSP-OPU, an FPGA-based overlay processor for digital signal processing. Specifically, we designed an overlay architecture suitable for various DSP algorithms. A unique data path was proposed with multiple computation engines and a reconfigurable pipeline. Additionally, we realized software-hardware co-design for DSP-OPU. On the software side, we achieved excellent scalability with a customized instruction set and a user-friendly compiler. On the hardware side, the reconfigurable data path breaks the dependency between instructions and data, enhancing scheduling ability and transmission efficiency. In addition, we also decreased the performance saturation with increasing engine numbers. Compared to C6678 SoC, our DSP-OPU achieves up to 29 $ \times $ speedup and 70 $ \times $ higher energy efficiency for different DSP algorithms. Compared to FPGA-based implementations for a single DSP algorithm, we achieved up to 4.5 $ \times $ speedup and 4.4 $ \times $ better energy efficiency.
Citation: Yueyin Bai, Song Zhang, Zhiyuan Ma, Enhao Tang, Jun Yu. DSP-OPU: An FPGA-based overlay processor for digital signal processing[J]. Electronic Research Archive, 2025, 33(5): 2698-2718. doi: 10.3934/era.2025119
Digital signal processing (DSP) is an important technology in various research fields. However, mainstream implementations of DSP algorithms either lack flexibility and scalability or have performance saturation and bottleneck, which still have the potential to be optimized. In this work, we proposed DSP-OPU, an FPGA-based overlay processor for digital signal processing. Specifically, we designed an overlay architecture suitable for various DSP algorithms. A unique data path was proposed with multiple computation engines and a reconfigurable pipeline. Additionally, we realized software-hardware co-design for DSP-OPU. On the software side, we achieved excellent scalability with a customized instruction set and a user-friendly compiler. On the hardware side, the reconfigurable data path breaks the dependency between instructions and data, enhancing scheduling ability and transmission efficiency. In addition, we also decreased the performance saturation with increasing engine numbers. Compared to C6678 SoC, our DSP-OPU achieves up to 29 $ \times $ speedup and 70 $ \times $ higher energy efficiency for different DSP algorithms. Compared to FPGA-based implementations for a single DSP algorithm, we achieved up to 4.5 $ \times $ speedup and 4.4 $ \times $ better energy efficiency.
| [1] | J. G. Proakis, Digital Signal Processing: Principles, Algorithms and Applications, Pearson Education India, 1996. |
| [2] |
M. K. Masten, I. Panahi, Digital signal processors for modern control systems, Control Eng. Pract., 5 (1997), 449–458. https://doi.org/10.1016/S0967-0661(97)00024-5 doi: 10.1016/S0967-0661(97)00024-5
|
| [3] | A. N. Sokolov, A. N. Ragozin, I. A. Pyatnitsky, S. K. Alabugin, Applying of digital signal processing techniques to improve the performance of machine learning-based cyber attack detection in industrial control system, in Proceedings of the 12th International Conference on Security of Information and Networks, (2019), 1–4. https://doi.org/10.1145/3357613.3357637 |
| [4] | M. Frerking, Digital Signal Processing in Communications Systems, Springer Science & Business Media, 2013. https://doi.org/10.1007/978-1-4757-4990-8 |
| [5] |
Z. Q. Luo, Applications of convex optimization in signal processing and digital communication, Math. Program., 97 (2003), 177–207. https://doi.org/10.1007/s10107-003-0442-2 doi: 10.1007/s10107-003-0442-2
|
| [6] | W. K. Pratt, Digital Image Processing: PIKS Scientific Inside, $4^th$ edition, Wiley Online Library, 2007. https://doi.org/10.1002/0470097434 |
| [7] |
E. Sisinni, A. Saifullah, S. Han, U. Jennehag, M. Gidlund, Industrial internet of things: Challenges, opportunities, and directions, IEEE Trans. Ind. Inf., 14 (2018), 4724–4734. https://doi.org/10.1109/TII.2018.2852491 doi: 10.1109/TII.2018.2852491
|
| [8] |
D. M. D'Addona, S. Conte, W. N. Lopes, P. R. de Aguiar, E. C. Bianchi, R. Teti, Tool condition monitoring of single-point dressing operation by digital signal processing of AE and AI, Procedia CIRP, 67 (2018), 307–312. https://doi.org/10.1016/j.procir.2017.12.218 doi: 10.1016/j.procir.2017.12.218
|
| [9] |
I. Tomkos, D. Klonidis, E. Pikasis, S. Theodoridis, Toward the 6G network era: Opportunities and challenges, IT Prof., 22 (2020), 34–38. https://doi.org/10.1109/MITP.2019.2963491 doi: 10.1109/MITP.2019.2963491
|
| [10] |
Y. Liu, R. Chen, S. Li, J. Yang, S. Li, S. Bruno, FPGA-based sparse matrix multiplication accelerators: From state-of-the-art to future opportunities, ACM Trans. Reconfigurable Technol. Syst., 17 (2024), 1–37. https://doi.org/10.1145/3687480 doi: 10.1145/3687480
|
| [11] |
S. M. Noor, E. John, M. Panday, Design and implementation of an ultralow-energy FFT ASIC for processing ECG in cardiac pacemakers, IEEE Trans. Very Large Scale Integr. VLSI Syst., 27 (2019), 983–987. https://doi.org/10.1109/TVLSI.2018.2883642 doi: 10.1109/TVLSI.2018.2883642
|
| [12] | G. N. Jyothi, S. Sriadibhatla, ASIC Implementation of Low Power, Area Efficient Adaptive FIR Filter Using Pipelined DA, in Microelectronics, Electromagnetics and Telecommunications, Springer, Singapore, 521 (2018), 385–394. https://doi.org/10.1007/978-981-13-1906-8_40 |
| [13] | X. Li, Design of array signal processing system based on TMS320C6678, in 2013 5th International Conference on Intelligent Networking and Collaborative Systems, IEEE, (2013), 611–616. https://doi.org/10.1109/INCoS.2013.114 |
| [14] | L. Babitha, U. Somanaidu, C. H. Poojitha, K. Niharika, V. Mahesh, V. Vijay, An efficient implementation of programmable IIR filter for FPGA, in Innovations in Signal Processing and Embedded Systems: Proceedings of ICISPES 2021, Springer, (2022), 109–117. https://doi.org/10.1007/978-981-19-1669-4_10 |
| [15] | A. Saeed, M. Elbably, G. Abdelfadeel, M. I. Eladawy, Efficient FPGA implementation of FFT/IFFT processor, Int. J. Circuits, Syst. Signal Process., 3 (2009), 103–110. |
| [16] | A. Paul, T. Z. Khan, P. Podder, M. M. Hasan, T. Ahmed, Reconfigurable architecture design of FIR and IIR in FPGA, in 2015 2nd International Conference on Signal Processing and Integrated Networks (SPIN), (2015), 958–963. https://doi.org/10.1109/SPIN.2015.7095408 |
| [17] | V. Vallabhuni, V. R. S. Rao, K. Chaitanya, S. C. Venkateshwarlu, C. S. Pittala, R. R. Vallabhuni, High-Performance IIR filter implementation using FPGA, in 2021 4th International Conference on Recent Trends in Computer Science and Technology (ICRTCST), (2022), 354–358. https://doi.org/10.1109/ICRTCST54752.2022.9781944 |
| [18] | S. Kavitha, G. Sinduja, M. Srimathi, K. Yogalakshmi, An efficient FPGA implementation of the multiplier-less LMS adaptive filter, in 2023 7th International Conference on Computing Methodologies and Communication (ICCMC), (2023), 441–445. https://doi.org/10.1109/ICCMC56507.2023.10084108 |
| [19] | R. Chen, H. Zhang, S. Li, E. Tang, J. Yu, K. Wang, Graph-OPU: A highly integrated FPGA-based overlay processor for graph neural networks, in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL), (2023), 228–234. https://doi.org/10.1109/FPL60245.2023.00039 |
| [20] |
M. T. Khan, M. A. Alhartomi, S. Alzahrani, R. A. Shaik, R. Alsulami, Two distributed arithmetic based high throughput architectures of non-Pipelined LMS adaptive filters, IEEE Access, 10 (2022), 76693–76706. https://doi.org/10.1109/ACCESS.2022.3192619 doi: 10.1109/ACCESS.2022.3192619
|
| [21] | M. Kowalczyk, T. Kryjak, Hardware architecture for high throughput event visual data filtering with matrix of IIR filters algorithm, in 2022 25th Euromicro Conference on Digital System Design (DSD), (2022), 284–291. https://doi.org/10.1109/DSD57027.2022.00046 |
| [22] | V. Pathak, S. J. Nanda, A. M. Joshi, S. S. Sahu, High speed implementation of Notch/Anti-notch IIR filter on FPGA, in 2018 15th IEEE India Council International Conference (INDICON), (2018), 1–6. https://doi.org/10.1109/INDICON45594.2018.8986985 |
| [23] | M. P. Garrido, The constant multiplier FFT, IEEE Trans. Circuits Syst. I Regul. Pap., 68 (2020), 322–335. https://doi.org/10.1109/tcsi.2020.3031688 |
| [24] |
J. Potsangbam, M. Kumar, Design and implementation of combined pipelining and parallel processing architecture for FIR and IIR filters using VHDL, Int. J. VLSI Des. Commun. Syst., 10 (2019), 1–16. https://doi.org/10.5121/vlsic.2019.10401 doi: 10.5121/vlsic.2019.10401
|
| [25] |
M. R. Ezilarasan, J. Britto Pari, M. F. Leung, Reconfigurable architecture for noise cancellation in acoustic environment using single multiply accumulate adaline filter, Electronics, 12 (2023), 810. https://doi.org/10.3390/electronics12040810 doi: 10.3390/electronics12040810
|
| [26] | Y. Bai, H. Zhou, K. Zhao, H. Wang, J. Chen, J. Yu, et al., Fet-opu: A flexible and efficient fpga-based overlay processor for transformer networks, in 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), (2023), 1–9. https://doi.org/10.1109/iccad57390.2023.10323752 |
| [27] | Y. Bai, K. Zhao, Y. Liu, H. Wang, H. Zhou, X. Wu, et al., CSTrans-OPU: An FPGA-based overlay processor with full compilation for transformer networks via sparsity exploration, in Proceedings of the 61st ACM/IEEE Design Automation Conference, (2024), 1–6. https://doi.org/10.1145/3649329.3657325 |
| [28] |
P. Ergül, H. F. Ugurdag, D. Davutoglu, HC-FFT: Highly configurable and efficient FFT implementation on FPGA, Turk. J. Electr. Eng. Comput. Sci., 29 (2021), 3150–3164. https://doi.org/10.3906/elk-2101-56 doi: 10.3906/elk-2101-56
|
| [29] | S. S. Rajput, D. S. Bhadauria, Implementation of fir filter using efficient window function and its application in filtering a speech signal, Int. J. Electr. Electron. Mech. Controls, 1 (2012). |