SASEGAN-TCN: Speech enhancement algorithm based on self-attention generative adversarial network and temporal convolutional network

Rongchuang Lv; Niansheng Chen; Songlin Cheng; Guangyu Fan; Lei Rao; Xiaoyong Song; Wenjing Lv; Dingyu Yang; Rongchuang Lv; Niansheng Chen; Songlin Cheng; Guangyu Fan; Lei Rao; Xiaoyong Song; Wenjing Lv; Dingyu Yang

doi:10.3934/mbe.2024172

Mathematical Biosciences and Engineering

2024, Volume 21, Issue 3: 3860-3875. doi: 10.3934/mbe.2024172

Previous Article Next Article

Research article Special Issues

SASEGAN-TCN: Speech enhancement algorithm based on self-attention generative adversarial network and temporal convolutional network

1.
School of Electronic Information Engineering, Shanghai Dianji University, Shanghai 201306, China
2.
School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China
3.
Alibaba Group, Shanghai 201203, China

Academic Editor: Jorge Bernardino

Received: 29 November 2023 Revised: 15 January 2024 Accepted: 22 January 2024 Published: 21 February 2024

Traditional unsupervised speech enhancement models often have problems such as non-aggregation of input feature information, which will introduce additional noise during training, thereby reducing the quality of the speech signal. In order to solve the above problems, this paper analyzed the impact of problems such as non-aggregation of input speech feature information on its performance. Moreover, this article introduced a temporal convolutional neural network and proposed a SASEGAN-TCN speech enhancement model, which captured local features information and aggregated global feature information to improve model effect and training stability. The simulation experiment results showed that the model can achieve 2.1636 and 92.78% in perceptual evaluation of speech quality (PESQ) score and short-time objective intelligibility (STOI) on the Valentini dataset, and can accordingly reach 1.8077 and 83.54% on the THCHS30 dataset. In addition, this article used the enhanced speech data for the acoustic model to verify the recognition accuracy. The speech recognition error rate was reduced by 17.4%, which was a significant improvement compared to the baseline model experimental results.
- speech enhancement,
- deep learning,
- generative adversarial network,
- autoencoder
Citation: Rongchuang Lv, Niansheng Chen, Songlin Cheng, Guangyu Fan, Lei Rao, Xiaoyong Song, Wenjing Lv, Dingyu Yang. SASEGAN-TCN: Speech enhancement algorithm based on self-attention generative adversarial network and temporal convolutional network[J]. Mathematical Biosciences and Engineering, 2024, 21(3): 3860-3875. doi: 10.3934/mbe.2024172

Related Papers:

Abstract

Traditional unsupervised speech enhancement models often have problems such as non-aggregation of input feature information, which will introduce additional noise during training, thereby reducing the quality of the speech signal. In order to solve the above problems, this paper analyzed the impact of problems such as non-aggregation of input speech feature information on its performance. Moreover, this article introduced a temporal convolutional neural network and proposed a SASEGAN-TCN speech enhancement model, which captured local features information and aggregated global feature information to improve model effect and training stability. The simulation experiment results showed that the model can achieve 2.1636 and 92.78% in perceptual evaluation of speech quality (PESQ) score and short-time objective intelligibility (STOI) on the Valentini dataset, and can accordingly reach 1.8077 and 83.54% on the THCHS30 dataset. In addition, this article used the enhanced speech data for the acoustic model to verify the recognition accuracy. The speech recognition error rate was reduced by 17.4%, which was a significant improvement compared to the baseline model experimental results.

References

[1]	A. R. Yuliani, M. F. Amri, E. Suryawati, A. Ramdan, H. F. Pardede, Speech enhancement using deep learning methods: A review, J. Elektron. Telekomunikasi, 21 (2021), 19–26. http://dx.doi.org/10.14203/jet.v21.19-26 doi: 10.14203/jet.v21.19-26
[2]	D. Skariah, J. Thomas, Review of speech enhancement methods using generative adversarial networks, in 2023 International Conference on Control, Communication and Computing (ICCC), Thiruvananthapuram, India, (2023), 1–4. https://doi.org/10.1109/ICCC57789.2023.10164848
[3]	S. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. Acoust. Speech Signal Process., 27 (1979), 113–120. https://doi.org/ 10.1109/TASSP.1979.1163209 doi: 10.1109/TASSP.1979.1163209
[4]	Y. Ephraim, H. L. Van Trees, A signal subspace approach for speech enhancement, IEEE Trans. Speech Audio Process., 3 (1995), 251–266. https://doi.org/10.1109/89.397090 doi: 10.1109/89.397090
[5]	D. S. Richards, VLSI median filters, IEEE Trans. Acoust. Speech Signal Process., 38 (1990), 145–153. https://doi.org/10.1109/29.45627 doi: 10.1109/29.45627
[6]	D. Burshtein, S. Gannot, Speech enhancement using a mixture-maximum model, IEEE Trans. Speech Audio Process., 10 (2002), 341–351. https://doi.org/10.1109/TSA.2002.803420 doi: 10.1109/TSA.2002.803420
[7]	B. Widrow, J. R. Glover, J. M. McCool, J. Kaunitz, C. S. Williams, et al., Adaptive noise cancelling: Principles and applicationsl, Proc. IEEE, 63 (1975), 1692–1716. https://doi.org/ 10.1109/PROC.1975.10036 doi: 10.1109/PROC.1975.10036
[8]	Y. Xu, J. Du, L. Dai, C. Lee, A regression approach to speech enhancement based on deep neural networks, IEEE/ACM Trans. Audio Speech Lang. Process., 23 (2014), 7–19. https://doi.org/10.1109/TASLP.2014.2364452 doi: 10.1109/TASLP.2014.2364452
[9]	D. Michelsanti, Z. Tan, S. Zhang, Y. Xu, M. Yu, et al., An overview of deep-learning-based audio-visual speech enhancement and separation, IEEE/ACM Trans. Audio Speech Lang. Process., 29 (2021), 1368–1396. https://doi.org/10.1109/TASLP.2021.3066303 doi: 10.1109/TASLP.2021.3066303
[10]	M. Gutiérrez-Muñoz, M. Coto-Jiménez, An experimental study on speech enhancement based on a combination of wavelets and deep learning, Computation, 10 (2022), 102. https://doi.org/10.3390/computation10060102 doi: 10.3390/computation10060102
[11]	T. Yadava, B. G. Nagaraja, H. S. Jayanna, A spatial procedure to spectral subtraction for speech enhancement, Multimedia Tools Appl., 81 (2022), 23633–23647. https://doi.org/10.1007/s11042-022-12152-3 doi: 10.1007/s11042-022-12152-3
[12]	L. Chai, J. Du, Q. Liu, C. Lee, A cross-entropy-guided measure (CEGM) for assessing speech recognition performance and optimizing DNN-based speech enhancement, IEEE/ACM Trans. Audio Speech Lang. Process., 29 (2020), 106–117. https://doi.org/10.1109/TASLP.2020.3036783 doi: 10.1109/TASLP.2020.3036783
[13]	E. M. Grais, M. U. Sen, H. Erdogan, Deep neural networks for single channel source separation, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, (2014), 3734–3738. https://doi.org/10.1109/ICASSP.2014.6854299
[14]	M. Strake, B. Defraene, K. Fluyt, W. Tirry, T. Fingscheidt, Fully convolutional recurrent networks for speech enhancement, in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, (2020), 6674–6678. https://doi.org/10.1109/ICASSP40776.2020.9054230
[15]	J. Cole, F. Mohammadzadeh, C. Bollinger, T. Latif, A. Bozkurt, E Lobaton, A study on motion mode identification for cyborg roaches, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, (2017), 2652–2656. https://doi.org/10.1109/ICASSP.2017.7952637
[16]	T. N. Sainath, B. Kingsbury, G. Saon, H. Soltau, A. Mohamed, G. Dahl, et al., Deep convolutional neural networks for large-scale speech tasks, Neural Networks, 64 (2015), 39–48. https://doi.org/10.1016/j.neunet.2014.08.005 doi: 10.1016/j.neunet.2014.08.005
[17]	H. S. Choi, J. Kim, J. Huh, A. Kim, J. Ha, K Lee, Phase-aware speech enhancement with deep complex U-net, preprint, arXiv: 1903.03107. https://doi.org/10.48550/arXiv.1903.03107
[18]	W. Chan, N. Jaitly, Q. Le, O. Vinyals, Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, (2016), 4960–4964. https://doi.org/10.1109/ICASSP.2016.7472621
[19]	T. A. Hsieh, H. M. Wang, X. Lu, Y. Tsao, WaveCRN: An efficient convolutional recurrent neural network for end-to-end speech enhancement, IEEE Signal Processing Lett., 27 (2020), 2149–2153. https://doi.org/ 10.1109/LSP.2020.3040693 doi: 10.1109/LSP.2020.3040693
[20]	Y. Lu, J. Zhou, M. Xu, A biologically inspired low energy clustering method for large scale wireless sensor networks, in 2019 IEEE International Conference of Intelligent Applied Systems on Engineering (ICIASE), Fuzhou, China, (2019), 20–23. https://doi.org/10.1109/ICIASE45644.2019.9074047
[21]	M. A. Kramer, Autoassociative neural networks, Comput. Chem. Eng., 16 (1992), 313–328. https://doi.org/10.1016/0098-1354(92)80051-A doi: 10.1016/0098-1354(92)80051-A
[22]	I. Goodfellow, J. P. Abadie, M. Mirza, B. Xu, D. W. Farley, S. Ozair, et al., Generative Adversarial Networks, preprint, arXiv: 1406.2661. https://doi.org/10.48550/arXiv.1406.2661
[23]	S. Pascual, A. Bonafonte, J. Serra, SEGAN: Speech enhancement generative adversarial network, preprint, arXiv: 1703.09452. https://doi.org/10.48550/arXiv.1703.09452
[24]	Z. Zhang, C. Deng, Y. Shen, D. S. Williamson, Y. Sha, Y. Zhang, et al., On loss functions and recurrency training for GAN-based speech enhancement systems, preprint, arXiv: 2007.14974. https://doi.org/10.48550/arXiv.2007.14974
[25]	M. H. Soni, N. Shah, H. A. Patil, Time-frequency masking-based speech enhancement using generative adversarial network, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, (2018), 5039–5043. https://doi.org/10.1109/ICASSP.2018.8462068
[26]	A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, preprint, arXiv: 1511.06434. https://doi.org/10.48550/arXiv.1511.06434
[27]	X. Hao, C. Shan, Y. Xu, S. Sun, L. Xie, An attention-based neural network approach for single channel speech enhancement, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, (2019), 6895–6899. https://doi.org/10.1109/ICASSP.2019.8683169
[28]	A. Pandey, D. Wang, Densely connected neural network with dilated convolutions for real-time speech enhancement in the time domain, in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, (2020), 6629–6633. https://doi.org/10.1109/ICASSP40776.2020.9054536
[29]	S. Bai, J. Z. Kolter, V. Koltun, An empirical evaluation of generic convolutional and recurrent networks for sequence modeling, preprint, arXiv: 1803.01271. https://doi.org/10.48550/arXiv.1803.01271
[30]	H. Phan, H. Le Nguyen, O. Y. Chén, P. Koch, N. Q. Duong, et al, Self-attention generative adversarial network for speech enhancement, in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, (2021), 7103–7107. https://doi.org/10.1109/ICASSP39728.2021.9414265
[31]	C. V. Botinhao, X. Wang, S. Takaki, J. Yamagishi, Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech, in Proc. 9th ISCA Workshop on Speech Synthesis Workshop, Sunnyvale, USA, (2016), 146–152. https://doi.org/10.21437/SSW.2016-24
[32]	D. Wang, X. Zhang, THCHS-30: A free Chinese speech corpus, preprint, arXiv: 1512.01882. https://doi.org/10.48550/arXiv.1512.01882
[33]	R. Lv, N. Chen, S. Cheng, G. Fan, L. Rao, X. Song, et al, ASKCC-DCNN-CTC: A Multi-Core Two Dimensional Causal Convolution Fusion Network with Attention Mechanism for End-to-End Speech Recognition, in 2023 26th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Rio de Janeiro, Brazil, (2023), 1490–1495. https://doi.org/10.1109/CSCWD57460.2023.10151993

Reader Comments

Your name:*

Email:*
© 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)