| A generalized smoothness criterion for
acoustic-to-articulatory inversion Prasanta Kumar Ghosh and Shrikanth Narayanan J. Acoust. Soc. Am. Express Letters, Volume 128, Issue 4, September, 2009, pp. 2162-2172 |
Abstract: The many-to-one mapping from representations in the speech articulatory space to acoustic space renders the associated acoustic-to-articulatory inverse mapping non-unique. Among various techniques, imposing smoothness constraints on the articulator trajectories is one of the common approaches to handle the non-uniqueness in the acoustic-to-articulatory inversion problem. This is because, articulators typically move smoothly during speech production. A standard smoothness constraint is to minimize the energy of the difference of the articulatory position sequence so that the articulator trajectory is smooth and low-pass in nature. Such a fixed definition of smoothness is not always realistic or adequate for all articulators because different articulators have different degrees of smoothness. In this paper, an optimization formulation is proposed for the inversion problem, which includes a generalized smoothness criterion. Under such generalized smoothness settings, the smoothness parameter can be chosen depending on the specific articulator in a data-driven fashion. In addition, this formulation allows estimation of articulatory positions recursively over time without any loss in performance. Experiments with the MOCHA TIMIT database show that the estimated articulator trajectories obtained using such a generalized smoothness criterion have lower RMS error and higher correlation with the actual measured trajectories compared to those obtained using a fixed smoothness constraint. |
(pdf) |
References: [1] S. Maeda, “Un modele articulatoire de la langue avec des composantes lineaires (an articulatory model of the tongue with linear components),” Actes 10emes Journees d’Etudesur la Parole (Grenoble, France),152–162 (1979). [2] S. Maeda, “Compensatory articulation during speech: Evidence from the analysis and synthesis of vocal tract shapes using an articulatory model,” Speech production and speech modelling, edited by W. Hardcastle and A. Marchal (Kluwer Academic Publishers, Dordrecht, The Netherlands),131–149 (1990). [3] J. L. Kelly and C. C. Lochbaum, “Speech synthesis,” Proc. Fourth Int. Congr. Acoust., Copenhagen,1–4 (1962). [4] C. P. Browman and L. Goldstein, “Towards an articulatory phonology,” Phonology Year-25 book 3, 219–252 (1986). [5] C. P. Browman and L. Goldstein, “Articulatory gestures as phonological units,” Phonology 6, 201–251 (1989). [6] C. P. Browman and L. Goldstein, “Gestural specification using dynamically-defined articulatory structures,” Journal of Phonetics 18, 299–320 (1990). [7] A. A. Wrench and H. J. William, “A multichannel articulatory database and its application for automatic speech recognition,” 5th Seminar on Speech Production: Models and Data, Bavaria,305–308 (2000). [8] I. Zlokarnik, “Adding articulatory features to acoustic features for automatic speech recognition,” J.Acoust. Soc. Am. 97, 3246(A) (1995). [9] A. A. Wrench and K. Richmond, “Continuous speech recognition using articulatory data,” Proc. ICSLP, Beijing, China,145–148 (2000). [10] J. Frankel, K. Richmond, S. King, and P. Taylor, “An automatic speech recognition system using neural networks and linear dynamic models to recover and model articulatory traces,” Proc. ICSLP, Beijing, China 4, 254–257 (2000). [11] K. Kirchhoff, Robust Speech Recognition using Articulatory Information (PhD Thesis, University of Bielefeld, 1999). [12] X. Zhuang, H. Nam, M. Hasegawa-Johnson, L. Goldstein, and E. Saltzman, “The entropy of the articulatory phonological code: Recognizing gestures from tract variables,” Proc. Interspeech, Brisbane, Australia,1489–1492 (2008). [13] C. Qin and M. A. Carreira-Perpinan, “An empirical investigation of the nonuniqueness in the acoustic-to-articulatory mapping,” Proc. Interspeech, (2007). [14] B. S. Atal, J. J. Chang, M. V. Mathews, and J. W. Tukey, “Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique,” J. Acoust. Soc. Am 63, 1535–1555 (1978). [15] A. Toutios and K. Margaritis, “Acoustic-to-articulatory inversion of speech: a review,” Proceedings of the International 12th TAINN, (2003). [16] V. Morozov, Regularization Methods for Ill-Posed Problem (Florida: CRC Press, 1993). [17] V. N. Sorokin, A. Leonov, and A. V. Trushkin, “Estimation of stability and accuracy of inverse problem solution for the vocal tract,” Speech communication 30, 55–74 (2000). [18] J. Schroeter and M. M. Sondhi, “Dynamic programming search of articulatory code-books,” Proceedings ICASSP, Glasgow, UK 1, 588–591 (1989). [19] M. G. Rahim, W. B. Kleijn, J. Schroeter, and C. C. Goodyear, “Acoustic-toarticulatory parameter mapping using an assembly of neural networks,” Proc. ICASSP,485–488 (1991). [20] J. Hogden, A. Lofqvist, V. Gracco, I. Zlokarnik, P. Rubin, and E. Saltzman, “Accurate recovery of articulator positions from acoustics: New conclusions based on human data,” J. Acoust. Soc. Am. 100, 1819–1834 (1996). [21] A. B. T. Toda and K. Tokuda, “Acoustic-to-articulatory inversion mapping with gaussian mixture model,” Proc. ICSLP, Jeju Island, Korea,1129–1132 (2004). [22] K. Richmond, “A trajectory mixture density network for the acoustic-articulatory inversion mapping,” Proc. ICSLP, Pittsburgh,USA,577–580 (2006). [23] K. Shirai and M. Honda, “Estimation of articulatory motion,” Dynamic Aspects of Speech Production, Tokyo Univ Press,279–302 (1976). [24] R. Wilhelms, P. Meyer, and H. W. Strube, “Estimation of articulatory trajectory by kalman filter,” I.T. Young et al., editor, Signal Processing III: Theories and Application,477–480 (1986). [25] G. Ramsay and L. Deng, “Maximum-likelihood estimation for articulatory speech recognition using a stochastic target mode,” Proc. EUROSPEECH,1401–1404 (1995). [26] S. Dusan and L. Deng, “Acoustic-to-articulatory inversion using dynamic and phonological constraint,” the 5th Speech Production Seminar, Munich, Germany,237–240 (2000). [27] K. Richmond, Estimating articulatory parameters from the acoustic speech signal (Ph.D. Thesis, The Centre for Speech Technology Research, Edinburgh University, 2002). [28] K. Richmond, “Mixture density networks, human articulatory data and acoustic-to-articulatory inversion of continuous speech,” Proceedings Workshop on Innovation in Speech Processing WISP, (2001). [29] S. Chennoukh, D. Sinder, G. Richard, and J. Flanagan, “Voice mimic system using an articulatory codebook for estimation of vocal tract shape,” Proc. Eurospeech, Rhodes, Greece,429–432 (1997). [30] R. Kuc, F. Tutuer, and J. R. Vaisnys, “Determining vocal tract shape by applying dynamic constraints,” Proc. ICASSP, Tampa, Florida, 1101–1104 (1985). [31] H. B. Richards, J. S. Mason, M. J. Hunt, and J. S. Bridle, “Deriving articulatory representations from speech with various excitation modes,” Proc. ICSLP, Philadelphia, PA, USA,1233–1236 (1996). [32] A. Lammert, D. P. W. Ellis, and P. Divenyi, “Data-driven articulatory inversion incorporating articulator priors,” ISCA Tutorial and Research Workshop on Statistical And Perceptual Audition, SAPA, Brisbane, Australia, (2008). [33] E. Muller and G. MacLeod, “Perioral biomechanics and its relation to labial motor control,” J. Acoust. Soc. Am. 71, S33–S33 (1982). [34] D. G. Manolakis, V. K. Ingle, and S. M. Kogon, Statistical and Adaptive Signal Processing: Spectral Estimation, Signal Modeling, Adaptive Filtering and Array Processing (Artech House Publisher, 2005). [35] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis (New York:Wiley,1983). [36] T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley Interscience, New York, 1991). [37] G. A. Darbellay and I. Vajda, “Estimation of the information by an adaptive partition of the observation space,” IEEE Transactions on Information Theory 45, 1315–1321 (1999). [38] C. Qin and M. A. Carreira-Perpinan, “A comparison of acoustic features for articulatory inversion,” Proc. Interspeech,2469–2472 (2007). |