The complex cepstrum vocoder is used to modify the speaker specific characteristics of the source speaker speech to that of the target speaker speech. The low time and high time liftering are used to split the calculated cepstrum into the vocal tract and the source excitation parameters. The obtained mixed phase vocal tract and source excitation parameters with finite impulse response preserve the phase properties of the resynthesized speech frame. The radial basis function is explored to capture the nonlinear mapping function for modifying the complex cepstrum based real and imaginary components of the vocal tract and source excitation of the speech signal. The state-of-the-art Mel cepstrum envelope and the fundamental frequency ( ) are considered to represent the vocal tract and the source excitation of the speech frame, respectively. Radial basis function is used to capture and formulate the nonlinear relations between the Mel cepstrum envelope of the source and target speakers. Mean and standard deviation approach is employed to modify the fundamental frequency ( ). The Mel log spectral approximation filter is used to reconstruct the speech signal from the modified Mel cepstrum envelope and fundamental frequency. A comparison of the proposed complex cepstrum based model has been made with the state-of-the-art Mel Cepstrum Envelope based voice conversion model with objective and subjective evaluations. The evaluation measures reveal that the proposed complex cepstrum based voice conversion system approximate the converted speech signal with better accuracy than the model based on the Mel cepstrum envelope based voice conversion. 1. Introduction The voice conversion (VC) system extracts the features of the source and the target speaker sound’s and formulates the mapping function to modify the features of the source speaker sound’s such that the resynthesized speech sound’s as if spoken by a target speaker [1]. Application of VC includes the personification of text to speech, design of multispeaker based speech synthesis system, audio dubbing, karaoke applications, security related system, the design of speaking aids for the speech impaired patient, broadcasting, and multimedia applications [2–4]. The VC involves the transformation of speaker specific characteristics such as vocal tract parameters, source excitation, and long term prosodic parameters with that of desired speaker parameters [5]. The vocal tract parameters are relatively more prominent for identifying the speaker uniqueness than the source excitation [5]. Several methods have been
References
[1]
H. Kuwabara and Y. Sagisak, “Acoustic characteristics of speaker individuality: control and conversion,” Speech Communication, vol. 16, no. 2, pp. 165–173, 1995.
[2]
K.-S. Lee, “Statistical approach for voice personality transformation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 2, pp. 641–651, 2007.
[3]
A. Kain and M. W. Macon, “Design and evaluation of a voice conversion algorithm based on spectral envelope mapping and residual prediction,” in Proceedings of the IEEE Interntional Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 813–816, May 2001.
[4]
D. Sundermann, “Voice conversion: state-of-the-art and future work,” in Proceedings of the 31st German Annual Conference on Acoustics (DAGA '01), Munich, Germany, 2001.
[5]
D. G. Childers, B. Yegnanarayana, and W. Ke, “Voice conversion: factors responsible for quality,” in Proceeding of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '85), vol. 1, pp. 748–751, Tampa, Fla, USA, 1985.
[6]
M. Abe, S. Nakanura, K. Shikano, and H. Kuwabara, “Voice conversion through vector quantization,” in Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 655–658, 1988.
[7]
W. Verhelst and J. Mertens, “Voice conversion using partitions of spectra feature space,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '96), pp. 365–368, May 1996.
[8]
N. Iwahashi and Y. Sagisaka, “Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by radial basis function networks,” Speech Communication, vol. 16, no. 2, pp. 139–151, 1995.
[9]
Y. Stylianou, O. Cappé, and E. Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Transactions on Speech and Audio Processing, vol. 6, no. 2, pp. 131–142, 1998.
[10]
S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad, “Spectral mapping using artificial neural networks for voice conversion,” IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 5, pp. 954–964, 2010.
[11]
C. Orphanidou, I. M. Moroz, and S. J. Roberts, “Wavelet-based voice morphing,” WSEAS Journal of Systems, vol. 10, no. 3, pp. 3297–3302, 2004.
[12]
E. Helander, T. Virtanen, N. Jani, and M. Gabbouj, “Voice conversion using partial least squares regression,” IEEE Transcation on Audio, Speech, Language Processing, vol. 18, no. 5, pp. 912–921, 2010.
[13]
L. M. Arslan, “Speaker transformation algorithm using segmental codebooks (STASC),” Speech Communication, vol. 28, no. 3, pp. 211–226, 1999.
[14]
K. S. Rao, “Voice conversion by mapping the speaker-specific features using pitch synchronous approach,” Computer Speech and Language, vol. 24, no. 3, pp. 474–494, 2010.
[15]
S. Hayakawa and F. Itakura, “Text-dependent speaker recognition using the information in the higher frequency band,” in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP '94), pp. 137–140, Adelaide, Australia, 1994.
[16]
H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigné, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds,” Speech Communication, vol. 27, no. 3, pp. 187–207, 1999.
[17]
R. J. McAulay and T. F. Quatieri, “Phase modelling and its application to sinusoidal transform coding,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '86), vol. 1, pp. 1713–1716, Tokyo, Japan, 1986.
[18]
J. Nirmal, P. Kachare, S. Patnaik, and M. Zaveri, “Cepstrum liftering based voice conversion using RBF and GMM,” in Proceedings of the IEEE International Conference on Communications and Signal Processing (ICCSP '13), pp. 570–575, April 2013.
[19]
A. V. Oppenheim, “Speech analysis and synthesis system based on homomorphic filtering,” The Journal of the Acoustical Society of America, vol. 45, no. 2, pp. 458–465, 1969.
[20]
W. Verhelst and O. Steenhaut, “A new model for the short-time complex cepstrum of voiced speech,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 34, no. 1, pp. 43–51, 1986.
[21]
H. Deng, R. K. Ward, M. P. Beddoes, and M. Hodgson, “A new method for obtaining accurate estimates of vocal-tract filters and glottal waves from vowel sounds,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 2, pp. 445–455, 2006.
[22]
T. Drugman, B. Bozkurt, and T. Dutoit, “Complex cepstrum-based decomposition of speech for glottal source estimation,” in Proceedings of the 10th Annual Conference of the International Speech Communication Association (INTERSPEECH '09), pp. 116–119, Brighton, UK, September 2009.
[23]
T. F. Quatieri Jr., “Minimum and mixed phase speech analysis-synthesis by adaptive homomorphic deconvolution,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, no. 4, pp. 328–335, 1979.
[24]
T. Drugman, B. Bozkurt, and T. Dutoit, “Causal-anticausal decomposition of speech using complex cepstrum for glottal source estimation,” Speech Communication, vol. 53, no. 6, pp. 855–866, 2011.
[25]
M. Vondra and R. Vích, “Speechmodeling using the complex cepstrum,” in Toward Autonomous, Adaptive, and Context-Aware Ultimodal Interfaces: Theoretical and Practical, Issues, vol. 6456 of Lecture Notes in Computer Science, pp. 324–330, 2011.
[26]
R. Maia, M. Akamine, and M. Gales, “Complex cepstrum as phase information in statistical parametric speech synthesis,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '12), pp. 4581–4584, 2012.
[27]
K. Shikano, S. Nakamura, and M. Abe, “Speaker adaptation and voice conversion by codebook mapping,” in Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 594–597, June 1991.
[28]
T. Toda, H. Saruwatari, and K. Shikano, “Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of straight spectrum,” in Proceedings of the IEEE Interntional Conference on Acoustics, Speech, and Signal Processing, pp. 841–844, May 2001.
[29]
H. Ye and S. Young, “High quality voice morphing,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. I9–I12, May 2004.
[30]
R. Laskar, K. Banerjee, F. Talukdar, and K. Sreenivasa Rao, “A pitch synchronous approach to design voice conversion system using source-filter correlation,” International Journal of Speech Technology, vol. 15, pp. 419–431, 2012.
[31]
D. Sündermann, A. Bonafonte, H. Ney, and H. H?ge, “A study on residual prediction techniques for voice conversion,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), pp. I13–I16, March 2005.
[32]
K. S. Rao, R. H. Laskar, and S. G. Koolagudi, “Voice transformation by mapping the features at syllable level,” in Pattern Recognition and Machine Intelligence, vol. 4815 of Lecture Notes in Computer Science, pp. 479–486, Springer, 2007.
[33]
http://sp-tk.sourceforge.net/.
[34]
S. Imai, “Cepstral analysis and synthesis on the mel-frequency scale,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '83), pp. 93–96, Boston, Mass, USA, 1983.
[35]
K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai, “Mel-generalized cepstral analysis—a unified approach to speech spectral estimation,” in Proceedings of the International Conference on Spoken Language Processing (ICSLP '94), pp. 1043–1046, 1994.
[36]
J. Kominek and A. W. Black, “CMU ARCTIC speech databases,” in Proceedings of the 5th ISCA Speech Synthesis Workshop, pp. 223–224, Pittsburgh, Pa, USA, June 2004.