Singing voice conversion methods encounter challenges in achieving a delicate balance between synthesis quality and singer similarity. Traditional voice conversion techniques primarily emphasize singer similarity, often leading to robotic-sounding singing voices. Deep learning-based singing voice conversion techniques, however, focus on disentangling singer-dependent and singer-independent features. While this approach can enhance the quality of synthesized singing voices, many voice conversion systems still grapple with the issue of singer-dependent feature leakage into content embeddings. In the proposed singing voice conversion technique, an encoder decoder framework was implemented using a hybrid model of convolutional neural network (CNN) accompanied by long short term memory (LSTM). This paper investigated the use of activation guidance and adaptive instance normalization techniques for one shot singing voice conversion. The instance normalization (IN) layers within the auto-encoder effectively separated singer and content representations. During conversion, singer representations were transferred using adaptive instance normalization (AdaIN) layers. This singing voice system with the help of activation function prevented the transfer of singer information while conveying the singing content. Additionally, the fusion of LSTM with CNN can enhance voice conversion models by capturing both local and contextual features. The one-shot capability simplified the architecture, utilizing a single encoder and decoder. Impressively, the proposed hybrid CNN-LSTM model achieved remarkable performance without compromising either quality or similarity. The objective and subjective evaluation assessments showed that the proposed hybrid CNN-LSTM model outperformed the baseline architectures. Evaluation results showed a mean opinion score (MOS) of 2.93 for naturalness and 3.35 for melodic similarity. These hybrid CNN-LSTM techniques allowed it to perform high-quality voice conversion with minimal training data, making it a promising solution for various applications.
Citation: Assila Yousuf, David Solomon George. A hybrid CNN-LSTM model with adaptive instance normalization for one shot singing voice conversion[J]. AIMS Electronics and Electrical Engineering, 2024, 8(3): 282-300. doi: 10.3934/electreng.2024013
Singing voice conversion methods encounter challenges in achieving a delicate balance between synthesis quality and singer similarity. Traditional voice conversion techniques primarily emphasize singer similarity, often leading to robotic-sounding singing voices. Deep learning-based singing voice conversion techniques, however, focus on disentangling singer-dependent and singer-independent features. While this approach can enhance the quality of synthesized singing voices, many voice conversion systems still grapple with the issue of singer-dependent feature leakage into content embeddings. In the proposed singing voice conversion technique, an encoder decoder framework was implemented using a hybrid model of convolutional neural network (CNN) accompanied by long short term memory (LSTM). This paper investigated the use of activation guidance and adaptive instance normalization techniques for one shot singing voice conversion. The instance normalization (IN) layers within the auto-encoder effectively separated singer and content representations. During conversion, singer representations were transferred using adaptive instance normalization (AdaIN) layers. This singing voice system with the help of activation function prevented the transfer of singer information while conveying the singing content. Additionally, the fusion of LSTM with CNN can enhance voice conversion models by capturing both local and contextual features. The one-shot capability simplified the architecture, utilizing a single encoder and decoder. Impressively, the proposed hybrid CNN-LSTM model achieved remarkable performance without compromising either quality or similarity. The objective and subjective evaluation assessments showed that the proposed hybrid CNN-LSTM model outperformed the baseline architectures. Evaluation results showed a mean opinion score (MOS) of 2.93 for naturalness and 3.35 for melodic similarity. These hybrid CNN-LSTM techniques allowed it to perform high-quality voice conversion with minimal training data, making it a promising solution for various applications.
[1] | Helander E, Virtanen T, Nurminen J, Gabbouj M (2010) Voice conversion using partial least squares regression. IEEE/ACM Transactions on Audio, Speech and Language Processing 18: 912–921. https://doi.org/10.1109/TASL.2011.2165944 doi: 10.1109/TASL.2011.2165944 |
[2] | Saito Y, Takamichi S, Saruwatari H (2017) Voice conversion using input-to-output highway networks. IEICE T Inf Syst 100: 1925–1928. https://doi.org/10.1587/transinf.2017EDL8034 doi: 10.1587/transinf.2017EDL8034 |
[3] | Yeh CC, Hsu PC, Chou JC, Lee HY, Lee LS (2018) Rhythm Flexible Voice Conversion Without Parallel Data Using Cycle-GAN Over Phoneme Posteriorgram Sequences. IEEE Spoken Language Technology Workshop (SLT) 274–281. https://doi.org/10.1109/SLT.2018.8639647 doi: 10.1109/SLT.2018.8639647 |
[4] | Sun L, Wang H, Kang S, Li K, Meng HM (2016) Personalized Cross-Lingual TTS Using Phonetic Posteriorgrams. Interspeech 322–326. https://doi.org/10.21437/Interspeech.2016-1043 doi: 10.21437/Interspeech.2016-1043 |
[5] | Tian X, Chng ES, Li H (2019) A Speaker-Dependent WaveNet for Voice Conversion with Non-Parallel Data. Interspeech 201–205. https://doi.org/10.21437/Interspeech.2019-1514 doi: 10.21437/Interspeech.2019-1514 |
[6] | Takahashi N, Singh MK, Mitsufuji Y (2023) Robust One-Shot Singing Voice Conversion. arXiv: 2210.11096v2. https://doi.org/10.48550/arXiv.2210.11096 |
[7] | Hono Y, Hashimoto K, Oura K, Nankaku Y, Tokuda K (2019) Singing Voice Synthesis Based on Generative Adversarial Networks. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 6955–6959. https://doi.org/10.1109/ICASSP.2019.8683154 doi: 10.1109/ICASSP.2019.8683154 |
[8] | Sun L, Kang S, Li K, Meng H (2015) Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 4869–4873. https://doi.org/10.1109/ICASSP.2015.7178896 doi: 10.1109/ICASSP.2015.7178896 |
[9] | Kaneko T, Kameoka H, Hiramatsu K, Kashino K (2017) Sequence-to-Sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks. Interspeech 2017: 1283–1287. http://dx.doi.org/10.21437/Interspeech.2017-970 doi: 10.21437/Interspeech.2017-970 |
[10] | Freixes M, Alías F, Carrie JC (2019) A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept. EURASIP Journal on Audio, Speech, and Music Processing 2019: 1–14. https://doi.org/10.1186/s13636-019-0163-y doi: 10.1186/s13636-019-0163-y |
[11] | Hono Y, Hashimoto K, Oura K, Nankaku Y, Tokuda K (2021) Sinsy: a deep neural network-based singing voice synthesis system. IEEE/ACM T Audio Spe 29: 2803–2815. https://doi.org/10.1109/TASLP.2021.3104165 doi: 10.1109/TASLP.2021.3104165 |
[12] | Sisman B, Vijayan K, Dong M, Li H (2019) SINGAN: Singing Voice Conversion with Generative Adversarial Networks. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 112–118. https://doi.org/10.1109/APSIPAASC47483.2019.9023162 doi: 10.1109/APSIPAASC47483.2019.9023162 |
[13] | Sisman B, Li H (2020) Generative adversarial networks for singing voice conversion with and without parallel data. Odyssey 238–244. https://doi.org/10.21437/Odyssey.2020-34 doi: 10.21437/Odyssey.2020-34 |
[14] | Zhao W, Wang W, Sun Y, Tang T (2019) Singing voice conversion based on wd-gan algorithm. IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC) 950–954. https://doi.org/10.1109/IAEAC47372.2019.8997824 doi: 10.1109/IAEAC47372.2019.8997824 |
[15] | Fang F, Yamagishi J, Echizen I, Lorenzo-Trueba J (2018) High-Quality Nonparallel Voice Conversion Based on Cycle-Consistent Adversarial Network. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5279–5283. https://doi.org/10.1109/ICASSP.2018.8462342 doi: 10.1109/ICASSP.2018.8462342 |
[16] | Kameoka H, Kaneko T, Tanaka K, Hojo N (2018) StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks. IEEE Spoken Language Technology Workshop (SLT) 266–273. https://doi.org/10.1109/SLT.2018.8639535 doi: 10.1109/SLT.2018.8639535 |
[17] | Chen Y, Xia R, Yang K, Zou K (2023) MICU: Image Super-resolution via Multi-level Information Compensation and U-net. Expert Syst Appl 245: 123111. https://doi.org/10.1016/j.eswa.2023.123111 doi: 10.1016/j.eswa.2023.123111 |
[18] | Chen Y, Xia R, Yang K, Zou K (2023) MFMAM: Image Inpainting via Multi-Scale Feature Module with Attention Module. Comput Vis Image Und 238: 103883. https://doi.org/10.1016/j.cviu.2023.103883 doi: 10.1016/j.cviu.2023.103883 |
[19] | Chen Y, Xia R, Yang K, Zou K (2023) GCAM: Lightweight Image Inpainting via Group Convolution and Attention Mechanism. Int J Mach Learn Cyb 15: 1815–1825. https://doi.org/10.1007/s13042-023-01999-z doi: 10.1007/s13042-023-01999-z |
[20] | Chen Y, Xia R, Yang K, Zou K (2024) DNNAM: Image Inpainting Algorithm via Deep Neural Networks and Attention Mechanism. Appl Soft Comput 111392. https://doi.org/10.1016/j.asoc.2024.111392 doi: 10.1016/j.asoc.2024.111392 |
[21] | Chen Y, Xia R, Yang K, Zou K (2023) DARGS: Image Inpainting Algorithm via Deep Attention Residuals Group and Semantics. J King Saud Univ-Comput 35: 101567. https://doi.org/10.1016/j.jksuci.2023.101567 doi: 10.1016/j.jksuci.2023.101567 |
[22] | Chen L, Zhang X, Li Y, Sun M, Chen W (2024) A Noise-Robust Voice Conversion Method with Controllable Background Sounds. Complex Intell Syst 1–14. https://doi.org/10.1007/s40747-024-01375-6 doi: 10.1007/s40747-024-01375-6 |
[23] | Walczyna T, Piotrowski Z (2023) Overview of Voice Conversion Methods Based on Deep Learning. Applied sciences 13: 3100. https://doi.org/10.3390/app13053100 doi: 10.3390/app13053100 |
[24] | Liu EM, Yeh JW, Lu JH, Liu YW (2023) Speaker Embedding Space Cosine Similarity Comparisons of Singing Voice Conversion. The Journal of the Acoustical Society of America (JASA) 154: A244–A244. https://doi.org/10.1121/10.0023424 doi: 10.1121/10.0023424 |
[25] | Hsu CC, Hwang HT, Wu YC, Tsao Y, Wang HM (2016) Voice conversion from non-parallel corpora using variational auto-encoder. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) 1–6. https://doi.org/10.1109/APSIPA.2016.7820786 doi: 10.1109/APSIPA.2016.7820786 |
[26] | Tobing PL, Wu YC, Hayashi T, Kobayashi K, Toda T (2019) Non-Parallel Voice Conversion with Cyclic Variational Autoencoder, Interspeech 674–678. https://doi.org/10.21437/Interspeech.2019-2307 doi: 10.21437/Interspeech.2019-2307 |
[27] | Yook D, Leem SG, Lee K, Yoo IC (2020) Many- to-many voice conversion using cycle-consistent variational autoencoder with multiple decoders. Odyssey 215–221. https://doi.org/10.21437/Odyssey.2020-31 doi: 10.21437/Odyssey.2020-31 |
[28] | Hsu CC, Hwang HT, Wu YC, Tsao Y, Wang HM (2017) Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. arXiv preprint arXiv: 1704.00849. https://doi.org/10.48550/arXiv.1704.0084 |
[29] | Huang WC, Violeta LP, Liu S, Shi J, Toda T (2023) The Singing Voice Conversion Challenge 2023. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 1–8. https://doi.org/10.1109/ASRU57964.2023.10389671 doi: 10.1109/ASRU57964.2023.10389671 |
[30] | Chen Q, Tan M, Qi Y, Zhou J, Li Y, Wu Q (2022) V2C: Visual Voice Cloning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 21242–21251. |
[31] | Qian K, Zhang Y, Chang S, Yang X, Hasegawa-Johnson M (2019) Autovc: Zero-shot voice style transfer with only autoencoder loss. International Conference on Machine Learning 5210–5219. |
[32] | Patel M, Purohit M, Parmar M, Shah NJ, Patil HA (2020) Adagan: Adaptive gan for many-to-many non-parallel voice conversion. |
[33] | Liu F, Wang H, Peng R, Zheng C, Li X (2021) U2-VC: one-shot voice conversion using two-level nested U-structure. EURASIP Journal on Audio, Speech, and Music Processing 2021: 1–15. https://doi.org/10.1186/s13636-021-00226-3 doi: 10.1186/s13636-021-00226-3 |
[34] | Liu F, Wang H, Ke Y, Zheng C (2022) One-shot voice conversion using a combination of U2-Net and vector quantization. Appl Acoust 199: 109014. https://doi.org/10.1016/j.apacoust.2022.109014 doi: 10.1016/j.apacoust.2022.109014 |
[35] | Wu DY, Lee HY (2020) One-shot voice conversion by vector quantization. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 7734–7738. https://doi.org/10.1109/ICASSP40776.2020.9053854 doi: 10.1109/ICASSP40776.2020.9053854 |
[36] | Chou JC, Lee HY (2019) One-Shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization. Interspeech 664–668. https://doi.org/10.21437/Interspeech.2019-2663 doi: 10.21437/Interspeech.2019-2663 |
[37] | Huang X, Belongie S (2017) Arbitrary style transfer in real-time with adaptive instance normalization. IEEE International Conference on Computer Vision (ICCV) 1501–1510. https://doi.org/10.1109/ICCV.2017.167 doi: 10.1109/ICCV.2017.167 |
[38] | Lian J, Lin P, Dai Y, Li G (2022) Arbitrary Voice Conversion via Adversarial Learning and Cycle Consistency Loss. International Conference on Intelligent Computing 569–578. https://doi.org/10.1007/978-3-031-13829-4_49 doi: 10.1007/978-3-031-13829-4_49 |
[39] | Gu Y, Zhao X, Yi X, Xiao J (2022) Voice Conversion Using learnable Similarity-Guided Masked Autoencoder. International Workshop on Digital watermarking 13825: 53–67. https://doi.org/10.1007/978-3-031-25115-3_4 doi: 10.1007/978-3-031-25115-3_4 |
[40] | Chen YH, Wu DY, Wu TH, Lee HY (2021) AGAIN-VC: A one-shot voice conversion using activation guidance and adaptive instance normalization. IEEE International Conference on Acoustics, Speech, and Signal Processing 5954–5958. https://doi.org/10.1109/ICASSP39728.2021.9414257 doi: 10.1109/ICASSP39728.2021.9414257 |
[41] | Ulyanov D, Lebedev V, Vedaldi A, Lempitsky VS (2016) Texture networks: Feed-forward synthesis of textures and stylized images. Proceedings of the 33nd International Conference on Machine Learning 1349–1357. |
[42] | Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the 32nd International Conference on International Conference on Machine Learning 37: 448–456. |
[43] | Li Y, Wang N, Shi J, Liu J, Hou X (2016) Revisiting batch normalization for practical domain adaptation. arXiv preprint arXiv: 1603.04779. |
[44] | Ulyanov D, Vedaldi A, Lempitsky V (2017) Improved Texture Networks: Maximizing Quality and Diversity in Feed-Forward Stylization and Texture Synthesis. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 4105–4113. https://doi.org/10.1109/CVPR.2017.437 doi: 10.1109/CVPR.2017.437 |
[45] | Liu J, Han W, Ruan H, Chen X, Jiang D, Li H (2018) Learning Salient Features for Speech Emotion Recognition Using CNN. First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia) 1–5. https://doi.org/10.1109/ACIIAsia.2018.8470393 doi: 10.1109/ACIIAsia.2018.8470393 |
[46] | Lim W, Jang D, Lee T (2016) Speech emotion recognition using convolutional and Recurrent Neural Networks. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) 1–4. https://doi.org/10.1109/APSIPA.2016.7820699 doi: 10.1109/APSIPA.2016.7820699 |
[47] | Hajarolasvadi N, Demirel H (2019) 3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms. Entropy (Basel) 21: 479. https://doi.org/10.3390/e21050479 doi: 10.3390/e21050479 |
[48] | Graves A (2012) Long Short-Term Memory Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence 385: 37–45. https://doi.org/10.1007/978-3-642-24797-2 doi: 10.1007/978-3-642-24797-2 |
[49] | Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv: 1412.3555. |
[50] | Kumar K, Kumar R, de Boissiere T, Gestin L, Teoh WZ, Sotelo J, et al. (2019) Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in Neural Information Processing Systems 14910–14921. |
[51] | Kong J, Kim J, Bae J (2020) HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Proceedings of the 34th International Conference on Neural Information Processing Systems 33: 17022–17033. |
[52] | Duan Z, Fang H, Li B, Sim KC, Wang Y (2013) The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 1–9. https://doi.org/10.1109/APSIPA.2013.6694316 doi: 10.1109/APSIPA.2013.6694316 |
[53] | Kubichek R (1993) Mel-cepstral distance measure for objective speech quality assessment. Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing 1: 125–128. https://doi.org/10.1109/PACRIM.1993.407206 doi: 10.1109/PACRIM.1993.407206 |
[54] | Kobayashi K, Toda T, Nakamura S (2018) Intra-gender statistical singing voice conversion with direct waveform modification using log spectral differential. Speech Commun 99: 211–220. https://doi.org/10.1016/j.specom.2018.03.011 doi: 10.1016/j.specom.2018.03.011 |
[55] | Toda T, Tokuda K (2007) A speech parameter generation algorithm considering global variance for hmm-based speech synthesis. IEICE T Inf Syst 90: 816–824. https://doi.org/10.1093/ietisy/e90-d.5.816 doi: 10.1093/ietisy/e90-d.5.816 |