Loading [MathJax]/jax/output/SVG/jax.js
Research article

A hybrid CNN-LSTM model with adaptive instance normalization for one shot singing voice conversion

  • Received: 04 March 2024 Revised: 22 May 2024 Accepted: 23 May 2024 Published: 03 June 2024
  • Singing voice conversion methods encounter challenges in achieving a delicate balance between synthesis quality and singer similarity. Traditional voice conversion techniques primarily emphasize singer similarity, often leading to robotic-sounding singing voices. Deep learning-based singing voice conversion techniques, however, focus on disentangling singer-dependent and singer-independent features. While this approach can enhance the quality of synthesized singing voices, many voice conversion systems still grapple with the issue of singer-dependent feature leakage into content embeddings. In the proposed singing voice conversion technique, an encoder decoder framework was implemented using a hybrid model of convolutional neural network (CNN) accompanied by long short term memory (LSTM). This paper investigated the use of activation guidance and adaptive instance normalization techniques for one shot singing voice conversion. The instance normalization (IN) layers within the auto-encoder effectively separated singer and content representations. During conversion, singer representations were transferred using adaptive instance normalization (AdaIN) layers. This singing voice system with the help of activation function prevented the transfer of singer information while conveying the singing content. Additionally, the fusion of LSTM with CNN can enhance voice conversion models by capturing both local and contextual features. The one-shot capability simplified the architecture, utilizing a single encoder and decoder. Impressively, the proposed hybrid CNN-LSTM model achieved remarkable performance without compromising either quality or similarity. The objective and subjective evaluation assessments showed that the proposed hybrid CNN-LSTM model outperformed the baseline architectures. Evaluation results showed a mean opinion score (MOS) of 2.93 for naturalness and 3.35 for melodic similarity. These hybrid CNN-LSTM techniques allowed it to perform high-quality voice conversion with minimal training data, making it a promising solution for various applications.

    Citation: Assila Yousuf, David Solomon George. A hybrid CNN-LSTM model with adaptive instance normalization for one shot singing voice conversion[J]. AIMS Electronics and Electrical Engineering, 2024, 8(3): 292-310. doi: 10.3934/electreng.2024013

    Related Papers:

    [1] P. Pirmohabbati, A. H. Refahi Sheikhani, H. Saberi Najafi, A. Abdolahzadeh Ziabari . Numerical solution of full fractional Duffing equations with Cubic-Quintic-Heptic nonlinearities. AIMS Mathematics, 2020, 5(2): 1621-1641. doi: 10.3934/math.2020110
    [2] SAIRA, Wenxiu Ma, Suliman Khan . An efficient numerical method for highly oscillatory logarithmic-algebraic singular integrals. AIMS Mathematics, 2025, 10(3): 4899-4914. doi: 10.3934/math.2025224
    [3] Kai Wang, Guicang Zhang . Curve construction based on quartic Bernstein-like basis. AIMS Mathematics, 2020, 5(5): 5344-5363. doi: 10.3934/math.2020343
    [4] Taher S. Hassan, Amir Abdel Menaem, Hasan Nihal Zaidi, Khalid Alenzi, Bassant M. El-Matary . Improved Kneser-type oscillation criterion for half-linear dynamic equations on time scales. AIMS Mathematics, 2024, 9(10): 29425-29438. doi: 10.3934/math.20241426
    [5] Dexin Meng . Wronskian-type determinant solutions of the nonlocal derivative nonlinear Schrödinger equation. AIMS Mathematics, 2025, 10(2): 2652-2667. doi: 10.3934/math.2025124
    [6] Samia BiBi, Md Yushalify Misro, Muhammad Abbas . Smooth path planning via cubic GHT-Bézier spiral curves based on shortest distance, bending energy and curvature variation energy. AIMS Mathematics, 2021, 6(8): 8625-8641. doi: 10.3934/math.2021501
    [7] Chunli Li, Wenchang Chu . Remarkable series concerning (3nn) and harmonic numbers in numerators. AIMS Mathematics, 2024, 9(7): 17234-17258. doi: 10.3934/math.2024837
    [8] Beatriz Campos, Alicia Cordero, Juan R. Torregrosa, Pura Vindel . Dynamical analysis of an iterative method with memory on a family of third-degree polynomials. AIMS Mathematics, 2022, 7(4): 6445-6466. doi: 10.3934/math.2022359
    [9] A. Palanisamy, J. Alzabut, V. Muthulakshmi, S. S. Santra, K. Nonlaopon . Oscillation results for a fractional partial differential system with damping and forcing terms. AIMS Mathematics, 2023, 8(2): 4261-4279. doi: 10.3934/math.2023212
    [10] Tongzhu Li, Ruiyang Lin . Classification of Möbius homogeneous curves in R4. AIMS Mathematics, 2024, 9(8): 23027-23046. doi: 10.3934/math.20241119
  • Singing voice conversion methods encounter challenges in achieving a delicate balance between synthesis quality and singer similarity. Traditional voice conversion techniques primarily emphasize singer similarity, often leading to robotic-sounding singing voices. Deep learning-based singing voice conversion techniques, however, focus on disentangling singer-dependent and singer-independent features. While this approach can enhance the quality of synthesized singing voices, many voice conversion systems still grapple with the issue of singer-dependent feature leakage into content embeddings. In the proposed singing voice conversion technique, an encoder decoder framework was implemented using a hybrid model of convolutional neural network (CNN) accompanied by long short term memory (LSTM). This paper investigated the use of activation guidance and adaptive instance normalization techniques for one shot singing voice conversion. The instance normalization (IN) layers within the auto-encoder effectively separated singer and content representations. During conversion, singer representations were transferred using adaptive instance normalization (AdaIN) layers. This singing voice system with the help of activation function prevented the transfer of singer information while conveying the singing content. Additionally, the fusion of LSTM with CNN can enhance voice conversion models by capturing both local and contextual features. The one-shot capability simplified the architecture, utilizing a single encoder and decoder. Impressively, the proposed hybrid CNN-LSTM model achieved remarkable performance without compromising either quality or similarity. The objective and subjective evaluation assessments showed that the proposed hybrid CNN-LSTM model outperformed the baseline architectures. Evaluation results showed a mean opinion score (MOS) of 2.93 for naturalness and 3.35 for melodic similarity. These hybrid CNN-LSTM techniques allowed it to perform high-quality voice conversion with minimal training data, making it a promising solution for various applications.



    We consider the following family of nonlinear oscillators

    yzz+k(y)y3z+h(y)y2z+f(y)yz+g(y)=0, (1.1)

    where k, h, f0 and g0 are arbitrary sufficiently smooth functions. Particular members of (1.1) are used for the description of various processes in physics, mechanics and so on and they also appear as invariant reductions of nonlinear partial differential equations [1,2,3].

    Integrability of (1.1) was studied in a number of works [4,5,6,7,8,9,10,11,12,13,14,15,16]. In particular, in [15] linearization of (1.1) via the following generalized nonlocal transformations

    w=F(y),dζ=(G1(y)yz+G2(y))dz. (1.2)

    was considered. However, equivalence problems with respect to transformations (1.2) for (1.1) and its integrable nonlinear subcases have not been studied previously. Therefore, in this work we deal with the equivalence problem for (1.1) and its integrable subcase from the Painlevé-Gambier classification. Namely, we construct an equivalence criterion for (1.1) and a non-canonical form of Ince Ⅶ equation [17,18]. As a result, we obtain two new integrable subfamilies of (1.1). What is more, we demonstrate that for any equation from (1.1) that satisfy one of these equivalence criteria one can construct an autonomous first integral in the parametric form. Notice that we use Ince Ⅶ equation because it is one of the simplest integrable members of (1.1) with known general solution and known classification of invariant curves.

    Moreover, we show that transformations (1.2) preserve autonomous invariant curves for equations from (1.1). Since the considered non-canonical form of Ince Ⅶ equation admits two irreducible polynomial invariant curves, we obtain that any equation from (1.1), which is equivalent to it, also admits two invariant curves. These invariant curves can be used for constructing an integrating factor for equations from (1.1) that are equivalent to Ince Ⅶ equation. If this integrating factor is Darboux one, then the corresponding equation is Liouvillian integrable [19]. This demonstrates the connection between nonlocal equivalence approach and Darboux integrability theory and its generalizations, which has been recently discussed for a less general class of nonlocal transformations in [20,21,22].

    The rest of this work is organized as follows. In the next Section we present an equivalence criterion for (1.1) and a non-canonical form of the Ince Ⅶ equation. In addition, we show how to construct an autonomous first integral for an equation from (1.1) satisfying this equivalence criterion. We also demonstrate that transformations (1.2) preserve autonomous invariant curves for (1.1). In Section 3 we provide two examples of integrable equations from (1.1) and construct their parametric first integrals, invariant curves and integrating factors. In the last Section we briefly discuss and summarize our results.

    We begin with the equivalence criterion between (1.1) and a non-canonical form of the Ince Ⅶ equation, that is [17,18]

    wζζ+3wζ+ϵw3+2w=0. (2.1)

    Here ϵ0 is an arbitrary parameter, which can be set, without loss of generality, to be equal to ±1.

    The general solution of (1.1) is

    w=e(ζζ0)cn{ϵ(e(ζζ0)C1),12}. (2.2)

    Here ζ0 and C1 are arbitrary constants and cn is the Jacobian elliptic cosine. Expression (2.2) will be used below for constructing autonomous parametric first integrals for members of (1.1).

    The equivalence criterion between (1.1) and (2.1) can be formulated as follows:

    Theorem 2.1. Equation (1.1) is equivalent to (2.1) if and only if either

    (I)25515lgp2qy+2352980l10+(3430q6667920p3)l514580qp310q276545lgqppy=0, (2.3)

    or

    (II)343l5972p3=0, (2.4)

    holds. Here

    l=9(fgygfy+fgh3kg2)2f3,p=gly3lgy+l(f23gh),q=25515gylp25103lgppy+686l58505p2(f23gh)l+6561p3. (2.5)

    The expression for G2 in each case is either

    (I)G2=126l2qp2470596l10(1333584p3+1372q)l5+q2, (2.6)

    or

    (II)G22=49l3G2+9p2189pl. (2.7)

    In all cases the functions F and G1 are given by

    F2=l81ϵG32,G1=G2(f3G2)3g. (2.8)

    Proof. We begin with the necessary conditions. Substituting (1.2) into (2.1) we get

    yzz+k(y)y3z+h(y)y2z+f(y)yz+g(y)=0, (2.9)

    where

    k=FG31(ϵF2+2)+3G21Fy+G1FyyFyG1,yG2Fy,h=G2Fyy+(6G1G2G2,y)Fy+3FG2G21(ϵF2+2)G2Fy,f=3G2(Fy+FG1(ϵF2+2))Fy,g=FG22(ϵF2+2)Fy. (2.10)

    As a consequence, we obtain that (1.1) can be transformed into (2.1) if it is of the form (2.9) (or (1.1)).

    Conversely, if the functions F, G1 and G2 satisfy (2.10) for some values of k, h, f and g, then (1.1) can be mapped into (2.1) via (1.2). Thus, we see that the compatibility conditions for (2.10) as an overdertmined system of equations for F, G1 and G2 result in the necessary and sufficient conditions for (1.1) to be equivalent to (2.1) via (1.2).

    To obtain the compatibility conditions, we simplify system (2.10) as follows. Using the last two equations from (2.10) we find the expression for G1 given in (2.8). Then, with the help of this relation, from (2.10) we find that

    81ϵF2G32l=0, (2.11)

    and

    567lG32+(243lgh81lf281gly+243lgy)G27l2=0,243lgG2,y+324lG3281glyG2+2l2=0, (2.12)

    Here l is given by (2.5).

    As a result, we need to find compatibility conditions only for (2.12). In order to find the generic case of this compatibility conditions, we differentiate the first equation twice and find the expression for G22 and condition (2.3). Differentiating the first equation from (2.12) for the third time, we obtain (2.6). Further differentiation does not lead to any new compatibility conditions. Particular case (2.4) can be treated in the similar way.

    Finally, we remark that the cases of l=0, p=0 and q=0 result in the degeneration of transformations (1.2). This completes the proof.

    As an immediate corollary of Theorem 2.1 we get

    Corollary 2.1. If coefficients of an equation from (1.1) satisfy either (2.3) or (2.4), then an autonomous first integral of this equation can be presented in the parametric form as follows:

    y=F1(w),yz=G2wζFyG1wζ. (2.13)

    Here w is the general solution of (2.1) given by (2.2). Notice also that, formally, (2.13) contains two arbitrary constants, namely ζ0 and C1. However, without loss of generality, one of them can be set equal to zero.

    Now we demonstrate that transformations (1.2) preserve autonomous invariant curves for equations from (1.1).

    First, we need to introduce the definition of an invariant curve for (1.1). We recall that Eq (1.1) can be transformed into an equivalent dynamical system

    yz=P,uz=Q,P=u,Q=ku3hu2fug. (2.14)

    A smooth function H(y,u) is called an invariant curve of (2.14) (or, equivalently, of (1.1)), if it is a nontrivial solution of [19]

    PHy+QHu=λH, (2.15)

    for some value of the function λ, which is called the cofactor of H.

    Second, we need to introduce the equation that is equivalent to (1.1) via (1.2). Substituting (1.2) into (1.1) we get

    wζζ+˜kw3ζ+˜hw2ζ+˜fwζ+˜g=0, (2.16)

    where

    ˜k=kG32gG31+(G1,yhG1)G22+(fG1G2,y)G1G2F2yG22,˜h=(hFyFyy)G22(2fG1G2,y)G2Fy+3gG21FyF2yG22,˜f=fG23gG1G22,˜g=gFyG22. (2.17)

    An invariant curve for (2.16) can be defined in the same way as that for (1.1). Notice that, further, we will denote wζ as v.

    Theorem 2.2. Suppose that either (1.1) possess an invariant curve H(y,u) with the cofactor λ(y,u) or (2.16) possess an invariant curve ˜H(w,v) with the cofactor ˜λ(w,v). Then, the other equation also has an invariant curve and the corresponding invariant curves and cofactors are connected via

    H(y,u)=˜H(F,FyuG1u+G2),λ(y,u)=(G1u+G2)˜λ(F,FyuG1u+G2). (2.18)

    Proof. Suppose that ˜H(w,v) is an invariant curve for (2.16) with the cofactor ˜λ(w,v). Then it satisfies

    v˜Hw+(˜kv3˜hv2˜fv˜g)˜Hv=˜λ˜H. (2.19)

    Substituting (1.2) into (2.19) we get

    uHy+(ku3hu2fug)H=(G1u+G2)˜λ(F,FyuG1u+G2)H. (2.20)

    This completes the proof.

    As an immediate consequence of Theorem 2.2 we have that transformations (1.2) preserve autonomous first integrals admitted by members of (1.1), since they are invariant curves with zero cofactors.

    Another corollary of Theorem 2.2 is that any equation from (1.1) that is connected to (2.1) admits two invariant curves that correspond to irreducible polynomial invariant curves of (2.1). This invariant curves of (2.1) and the corresponding cofactors are the following (see, [23] formulas (3.18) and (3.19) taking into account scaling transformations)

    ˜H=±i2ϵ(v+w)+w2,˜λ=±2ϵw2. (2.21)

    Therefore, we have that the following statement holds:

    Corollary 2.2. If coefficients of an equation from (1.1) satisfy either (2.3) or (2.4), then is admits the following invariant curves with the corresponding cofactors

    H=±i2ϵ(FyuG1u+G2+F)+F2,λ=(G1u+G2)(±2ϵF2). (2.22)

    Let us remark that connections between (2.1) and non-autonomous variants of (1.1) can be considered via a non-autonomous generalization of transformations (1.2). However, one of two nonlocally related equations should be autonomous since otherwise nonlocal transformations do not map a differential equation into a differential equation [5].

    In this Section we have obtained the equivalence criterion between (1.1) and (2.1), that defines two new completely integrable subfamilies of (1.1). We have also demonstrated that members of these subfamilies posses an autonomous parametric first integral and two autonomous invariant curves.

    In this Section we provide two examples of integrable equations from (1.1) satisfying integrability conditions from Theorem 2.1.

    Example 1. One can show that the coefficients of the following cubic oscillator

    yzz12ϵμy(ϵμ2y4+2)2y3z6μyyz+2μ2y3(ϵμ2y4+2)=0, (3.1)

    satisfy condition (2.3) from Theorem 2.1. Consequently, Eq (3.1) is completely integrable and its general solution can be obtained from (2.2) by inverting transformations (1.2). However, it is more convenient to use Corollary 2.1 and present the autonomous first integral of (3.1) in the parametric form as follows:

    y=±wμ,yz=w(ϵw2+2)wζ2wζ+w(ϵw2+2), (3.2)

    where w is given by (2.2), ζ is considered as a parameter and ζ0, without loss of generality, can be set equal to zero. As a result, we see that (3.1) is integrable since it has an autonomous first integral.

    Moreover, using Corollary 2.2 one can find invariant curves admitted by (3.1)

    H1,2=y4[(2±ϵμy2)2(2ϵμy2)+2(ϵμy22ϵ)u]2μ2y2(ϵμ2y4+2)4u,λ1,2=±2(μy2(ϵμ2y4+2)2u)(2ϵμy22)y(ϵμ2y4+2) (3.3)

    With the help of the standard technique of the Darboux integrability theory [19], it is easy to find the corresponding Darboux integrating factor of (3.1)

    M=(ϵμ2y4+2)94(2ϵu2+(ϵμ2y4+2)2)34(μy2(ϵμ2y4+2)2u)32. (3.4)

    Consequently, equation is (3.1) Liouvillian integrable.

    Example 2. Consider the Liénard (1, 9) equation

    yzz+(biyi)yz+ajyj=0,i=0,4,j=0,,9. (3.5)

    Here summation over repeated indices is assumed. One can show that this equation is equivalent to (2.1) if it is of the form

    yzz9(y+μ)(y+3μ)3yz+2y(2y+3μ)(y+3μ)7=0, (3.6)

    where μ is an arbitrary constant.

    With the help of Corollary 2.1 one can present the first integral of (3.6) in the parametric form as follows:

    y=32ϵμw22ϵw,yz=77762ϵμ5wwζ(2ϵw2)5(2ϵwζ+(2ϵw+2ϵ)w), (3.7)

    where w is given by (2.2). Thus, one can see that (3.5) is completely integrable due to the existence of this parametric autonomous first integral.

    Using Corollary 2.2 we find two invariant curves of (3.6):

    H1=y2[(2y+3μ)(y+3μ)42u)](y+3μ)2[(y+3μ)4yu],λ1=6μ(uy(y+3μ)4)y(y+3μ), (3.8)

    and

    H2=y2(y+3μ)2y(y+3μ)4u,λ2=2(2y+3μ)(u2y(y+3μ)4)y(y+3μ). (3.9)

    The corresponding Darboux integrating factor is

    M=[y(y+3μ)4u]32[(2y+3μ)(y+3μ)42u]34. (3.10)

    As a consequence, we see that Eq (3.6) is Liouvillian integrable.

    Therefore, we see that equations considered in Examples 1 and 2 are completely integrable from two points of view. First, they possess autonomous parametric first integrals. Second, they have Darboux integrating factors.

    In this work we have considered the equivalence problem between family of Eqs (1.1) and its integrable member (2.1), with equivalence transformations given by generalized nonlocal transformations (1.2). We construct the corresponding equivalence criterion in the explicit form, which leads to two new integrable subfamilies of (1.1). We have demonstrated that one can explicitly construct a parametric autonomous first integral for each equation that is equivalent to (2.1) via (1.2). We have also shown that transformations (1.2) preserve autonomous invariant curves for (1.1). As a consequence, we have obtained that equations from the obtained integrable subfamilies posses two autonomous invariant curves, which corresponds to the irreducible polynomial invariant curves of (2.1). This fact demonstrate a connection between nonlocal equivalence approach and Darboux and Liouvillian integrability approach. We have illustrate our results by two examples of integrable equations from (1.1).

    The author was partially supported by Russian Science Foundation grant 19-71-10003.

    The author declares no conflict of interest in this paper.



    [1] Helander E, Virtanen T, Nurminen J, Gabbouj M (2010) Voice conversion using partial least squares regression. IEEE/ACM Transactions on Audio, Speech and Language Processing 18: 912–921. https://doi.org/10.1109/TASL.2011.2165944 doi: 10.1109/TASL.2011.2165944
    [2] Saito Y, Takamichi S, Saruwatari H (2017) Voice conversion using input-to-output highway networks. IEICE T Inf Syst 100: 1925–1928. https://doi.org/10.1587/transinf.2017EDL8034 doi: 10.1587/transinf.2017EDL8034
    [3] Yeh CC, Hsu PC, Chou JC, Lee HY, Lee LS (2018) Rhythm Flexible Voice Conversion Without Parallel Data Using Cycle-GAN Over Phoneme Posteriorgram Sequences. IEEE Spoken Language Technology Workshop (SLT) 274–281. https://doi.org/10.1109/SLT.2018.8639647 doi: 10.1109/SLT.2018.8639647
    [4] Sun L, Wang H, Kang S, Li K, Meng HM (2016) Personalized Cross-Lingual TTS Using Phonetic Posteriorgrams. Interspeech 322–326. https://doi.org/10.21437/Interspeech.2016-1043 doi: 10.21437/Interspeech.2016-1043
    [5] Tian X, Chng ES, Li H (2019) A Speaker-Dependent WaveNet for Voice Conversion with Non-Parallel Data. Interspeech 201–205. https://doi.org/10.21437/Interspeech.2019-1514 doi: 10.21437/Interspeech.2019-1514
    [6] Takahashi N, Singh MK, Mitsufuji Y (2023) Robust One-Shot Singing Voice Conversion. arXiv: 2210.11096v2. https://doi.org/10.48550/arXiv.2210.11096
    [7] Hono Y, Hashimoto K, Oura K, Nankaku Y, Tokuda K (2019) Singing Voice Synthesis Based on Generative Adversarial Networks. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 6955–6959. https://doi.org/10.1109/ICASSP.2019.8683154 doi: 10.1109/ICASSP.2019.8683154
    [8] Sun L, Kang S, Li K, Meng H (2015) Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 4869–4873. https://doi.org/10.1109/ICASSP.2015.7178896 doi: 10.1109/ICASSP.2015.7178896
    [9] Kaneko T, Kameoka H, Hiramatsu K, Kashino K (2017) Sequence-to-Sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks. Interspeech 2017: 1283–1287. http://dx.doi.org/10.21437/Interspeech.2017-970 doi: 10.21437/Interspeech.2017-970
    [10] Freixes M, Alías F, Carrie JC (2019) A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept. EURASIP Journal on Audio, Speech, and Music Processing 2019: 1–14. https://doi.org/10.1186/s13636-019-0163-y doi: 10.1186/s13636-019-0163-y
    [11] Hono Y, Hashimoto K, Oura K, Nankaku Y, Tokuda K (2021) Sinsy: a deep neural network-based singing voice synthesis system. IEEE/ACM T Audio Spe 29: 2803–2815. https://doi.org/10.1109/TASLP.2021.3104165 doi: 10.1109/TASLP.2021.3104165
    [12] Sisman B, Vijayan K, Dong M, Li H (2019) SINGAN: Singing Voice Conversion with Generative Adversarial Networks. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 112–118. https://doi.org/10.1109/APSIPAASC47483.2019.9023162 doi: 10.1109/APSIPAASC47483.2019.9023162
    [13] Sisman B, Li H (2020) Generative adversarial networks for singing voice conversion with and without parallel data. Odyssey 238–244. https://doi.org/10.21437/Odyssey.2020-34 doi: 10.21437/Odyssey.2020-34
    [14] Zhao W, Wang W, Sun Y, Tang T (2019) Singing voice conversion based on wd-gan algorithm. IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC) 950–954. https://doi.org/10.1109/IAEAC47372.2019.8997824 doi: 10.1109/IAEAC47372.2019.8997824
    [15] Fang F, Yamagishi J, Echizen I, Lorenzo-Trueba J (2018) High-Quality Nonparallel Voice Conversion Based on Cycle-Consistent Adversarial Network. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5279–5283. https://doi.org/10.1109/ICASSP.2018.8462342 doi: 10.1109/ICASSP.2018.8462342
    [16] Kameoka H, Kaneko T, Tanaka K, Hojo N (2018) StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks. IEEE Spoken Language Technology Workshop (SLT) 266–273. https://doi.org/10.1109/SLT.2018.8639535 doi: 10.1109/SLT.2018.8639535
    [17] Chen Y, Xia R, Yang K, Zou K (2023) MICU: Image Super-resolution via Multi-level Information Compensation and U-net. Expert Syst Appl 245: 123111. https://doi.org/10.1016/j.eswa.2023.123111 doi: 10.1016/j.eswa.2023.123111
    [18] Chen Y, Xia R, Yang K, Zou K (2023) MFMAM: Image Inpainting via Multi-Scale Feature Module with Attention Module. Comput Vis Image Und 238: 103883. https://doi.org/10.1016/j.cviu.2023.103883 doi: 10.1016/j.cviu.2023.103883
    [19] Chen Y, Xia R, Yang K, Zou K (2023) GCAM: Lightweight Image Inpainting via Group Convolution and Attention Mechanism. Int J Mach Learn Cyb 15: 1815–1825. https://doi.org/10.1007/s13042-023-01999-z doi: 10.1007/s13042-023-01999-z
    [20] Chen Y, Xia R, Yang K, Zou K (2024) DNNAM: Image Inpainting Algorithm via Deep Neural Networks and Attention Mechanism. Appl Soft Comput 111392. https://doi.org/10.1016/j.asoc.2024.111392 doi: 10.1016/j.asoc.2024.111392
    [21] Chen Y, Xia R, Yang K, Zou K (2023) DARGS: Image Inpainting Algorithm via Deep Attention Residuals Group and Semantics. J King Saud Univ-Comput 35: 101567. https://doi.org/10.1016/j.jksuci.2023.101567 doi: 10.1016/j.jksuci.2023.101567
    [22] Chen L, Zhang X, Li Y, Sun M, Chen W (2024) A Noise-Robust Voice Conversion Method with Controllable Background Sounds. Complex Intell Syst 1–14. https://doi.org/10.1007/s40747-024-01375-6 doi: 10.1007/s40747-024-01375-6
    [23] Walczyna T, Piotrowski Z (2023) Overview of Voice Conversion Methods Based on Deep Learning. Applied sciences 13: 3100. https://doi.org/10.3390/app13053100 doi: 10.3390/app13053100
    [24] Liu EM, Yeh JW, Lu JH, Liu YW (2023) Speaker Embedding Space Cosine Similarity Comparisons of Singing Voice Conversion. The Journal of the Acoustical Society of America (JASA) 154: A244–A244. https://doi.org/10.1121/10.0023424 doi: 10.1121/10.0023424
    [25] Hsu CC, Hwang HT, Wu YC, Tsao Y, Wang HM (2016) Voice conversion from non-parallel corpora using variational auto-encoder. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) 1–6. https://doi.org/10.1109/APSIPA.2016.7820786 doi: 10.1109/APSIPA.2016.7820786
    [26] Tobing PL, Wu YC, Hayashi T, Kobayashi K, Toda T (2019) Non-Parallel Voice Conversion with Cyclic Variational Autoencoder, Interspeech 674–678. https://doi.org/10.21437/Interspeech.2019-2307 doi: 10.21437/Interspeech.2019-2307
    [27] Yook D, Leem SG, Lee K, Yoo IC (2020) Many- to-many voice conversion using cycle-consistent variational autoencoder with multiple decoders. Odyssey 215–221. https://doi.org/10.21437/Odyssey.2020-31 doi: 10.21437/Odyssey.2020-31
    [28] Hsu CC, Hwang HT, Wu YC, Tsao Y, Wang HM (2017) Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. arXiv preprint arXiv: 1704.00849. https://doi.org/10.48550/arXiv.1704.0084
    [29] Huang WC, Violeta LP, Liu S, Shi J, Toda T (2023) The Singing Voice Conversion Challenge 2023. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 1–8. https://doi.org/10.1109/ASRU57964.2023.10389671 doi: 10.1109/ASRU57964.2023.10389671
    [30] Chen Q, Tan M, Qi Y, Zhou J, Li Y, Wu Q (2022) V2C: Visual Voice Cloning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 21242–21251.
    [31] Qian K, Zhang Y, Chang S, Yang X, Hasegawa-Johnson M (2019) Autovc: Zero-shot voice style transfer with only autoencoder loss. International Conference on Machine Learning 5210–5219.
    [32] Patel M, Purohit M, Parmar M, Shah NJ, Patil HA (2020) Adagan: Adaptive gan for many-to-many non-parallel voice conversion.
    [33] Liu F, Wang H, Peng R, Zheng C, Li X (2021) U2-VC: one-shot voice conversion using two-level nested U-structure. EURASIP Journal on Audio, Speech, and Music Processing 2021: 1–15. https://doi.org/10.1186/s13636-021-00226-3 doi: 10.1186/s13636-021-00226-3
    [34] Liu F, Wang H, Ke Y, Zheng C (2022) One-shot voice conversion using a combination of U2-Net and vector quantization. Appl Acoust 199: 109014. https://doi.org/10.1016/j.apacoust.2022.109014 doi: 10.1016/j.apacoust.2022.109014
    [35] Wu DY, Lee HY (2020) One-shot voice conversion by vector quantization. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 7734–7738. https://doi.org/10.1109/ICASSP40776.2020.9053854 doi: 10.1109/ICASSP40776.2020.9053854
    [36] Chou JC, Lee HY (2019) One-Shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization. Interspeech 664–668. https://doi.org/10.21437/Interspeech.2019-2663 doi: 10.21437/Interspeech.2019-2663
    [37] Huang X, Belongie S (2017) Arbitrary style transfer in real-time with adaptive instance normalization. IEEE International Conference on Computer Vision (ICCV) 1501–1510. https://doi.org/10.1109/ICCV.2017.167 doi: 10.1109/ICCV.2017.167
    [38] Lian J, Lin P, Dai Y, Li G (2022) Arbitrary Voice Conversion via Adversarial Learning and Cycle Consistency Loss. International Conference on Intelligent Computing 569–578. https://doi.org/10.1007/978-3-031-13829-4_49 doi: 10.1007/978-3-031-13829-4_49
    [39] Gu Y, Zhao X, Yi X, Xiao J (2022) Voice Conversion Using learnable Similarity-Guided Masked Autoencoder. International Workshop on Digital watermarking 13825: 53–67. https://doi.org/10.1007/978-3-031-25115-3_4 doi: 10.1007/978-3-031-25115-3_4
    [40] Chen YH, Wu DY, Wu TH, Lee HY (2021) AGAIN-VC: A one-shot voice conversion using activation guidance and adaptive instance normalization. IEEE International Conference on Acoustics, Speech, and Signal Processing 5954–5958. https://doi.org/10.1109/ICASSP39728.2021.9414257 doi: 10.1109/ICASSP39728.2021.9414257
    [41] Ulyanov D, Lebedev V, Vedaldi A, Lempitsky VS (2016) Texture networks: Feed-forward synthesis of textures and stylized images. Proceedings of the 33nd International Conference on Machine Learning 1349–1357.
    [42] Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the 32nd International Conference on International Conference on Machine Learning 37: 448–456.
    [43] Li Y, Wang N, Shi J, Liu J, Hou X (2016) Revisiting batch normalization for practical domain adaptation. arXiv preprint arXiv: 1603.04779.
    [44] Ulyanov D, Vedaldi A, Lempitsky V (2017) Improved Texture Networks: Maximizing Quality and Diversity in Feed-Forward Stylization and Texture Synthesis. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 4105–4113. https://doi.org/10.1109/CVPR.2017.437 doi: 10.1109/CVPR.2017.437
    [45] Liu J, Han W, Ruan H, Chen X, Jiang D, Li H (2018) Learning Salient Features for Speech Emotion Recognition Using CNN. First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia) 1–5. https://doi.org/10.1109/ACIIAsia.2018.8470393 doi: 10.1109/ACIIAsia.2018.8470393
    [46] Lim W, Jang D, Lee T (2016) Speech emotion recognition using convolutional and Recurrent Neural Networks. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) 1–4. https://doi.org/10.1109/APSIPA.2016.7820699 doi: 10.1109/APSIPA.2016.7820699
    [47] Hajarolasvadi N, Demirel H (2019) 3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms. Entropy (Basel) 21: 479. https://doi.org/10.3390/e21050479 doi: 10.3390/e21050479
    [48] Graves A (2012) Long Short-Term Memory Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence 385: 37–45. https://doi.org/10.1007/978-3-642-24797-2 doi: 10.1007/978-3-642-24797-2
    [49] Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv: 1412.3555.
    [50] Kumar K, Kumar R, de Boissiere T, Gestin L, Teoh WZ, Sotelo J, et al. (2019) Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in Neural Information Processing Systems 14910–14921.
    [51] Kong J, Kim J, Bae J (2020) HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Proceedings of the 34th International Conference on Neural Information Processing Systems 33: 17022–17033.
    [52] Duan Z, Fang H, Li B, Sim KC, Wang Y (2013) The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 1–9. https://doi.org/10.1109/APSIPA.2013.6694316 doi: 10.1109/APSIPA.2013.6694316
    [53] Kubichek R (1993) Mel-cepstral distance measure for objective speech quality assessment. Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing 1: 125–128. https://doi.org/10.1109/PACRIM.1993.407206 doi: 10.1109/PACRIM.1993.407206
    [54] Kobayashi K, Toda T, Nakamura S (2018) Intra-gender statistical singing voice conversion with direct waveform modification using log spectral differential. Speech Commun 99: 211–220. https://doi.org/10.1016/j.specom.2018.03.011 doi: 10.1016/j.specom.2018.03.011
    [55] Toda T, Tokuda K (2007) A speech parameter generation algorithm considering global variance for hmm-based speech synthesis. IEICE T Inf Syst 90: 816–824. https://doi.org/10.1093/ietisy/e90-d.5.816 doi: 10.1093/ietisy/e90-d.5.816
  • This article has been cited by:

    1. Dmitry I. Sinelshchikov, Linearizabiliy and Lax representations for cubic autonomous and non-autonomous nonlinear oscillators, 2023, 01672789, 133721, 10.1016/j.physd.2023.133721
    2. Jaume Giné, Xavier Santallusia, Integrability via algebraic changes of variables, 2024, 184, 09600779, 115026, 10.1016/j.chaos.2024.115026
    3. Meryem Belattar, Rachid Cheurfa, Ahmed Bendjeddou, Paulo Santana, A class of nonlinear oscillators with non-autonomous first integrals and algebraic limit cycles, 2023, 14173875, 1, 10.14232/ejqtde.2023.1.50
    4. Jaume Giné, Dmitry Sinelshchikov, Integrability of Oscillators and Transcendental Invariant Curves, 2025, 24, 1575-5460, 10.1007/s12346-024-01182-x
  • Reader Comments
  • © 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(1294) PDF downloads(141) Cited by(0)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog