Processing speech signal using auditory-like filterbank

provides least uncertainty about articulatory gestures


Prasanta Kumar Ghosh, Louis M. Goldstein and Shrikanth Narayanan

J. Acoust. Soc.
Am. , Volume 129, Issue 6, June, 2011, pp. 4014-4022

Abstract: Understanding how the human speech production system is related to the human auditory system has been a perennial subject of inquiry. To investigate the production–perception link, in this paper, a computational analysis has been performed using the articulatory movement data obtained during speech production with concurrently recorded acoustic speech signals from multiple subjects in three different languages: English, Cantonese, and Georgian. The form of articulatory gestures during speech production varies across languages, and this variation is considered to be reflected in the articulatory position and kinematics. The auditory processing of the acoustic speech signal is modeled by a parametric representation of the cochlear filterbank which allows for realizing various candidate filterbank structures by changing the parameter value. Using mathematical communication theory, it is found that the uncertainty about the articulatory gestures in each language is maximally reduced when the acoustic speech signal is represented using the output of a filterbank similar to the empirically established cochlear filterbank in the human auditory system. Possible interpretations of this finding are discussed.

(pdf)
Copyright (2011) Acoustical Society of America. This article may be downloaded for personal use only. Any other use requires prior permission of the author and the Acoustical Society of America.


References:

[1] C. P. Browman and L. Goldstein, “Articulatory gestures as phonological units,” Phonology 6(2), 201–251 (1989).
[2] S. S. Stevens and F. Warshofsky, Sound and hearing, rev. ed. (New York, Time, Inc., 1980), p. 58.
[3] E. C. Smith and M. S. Lewicki, “Efficient auditory coding,” Nature 439, 978–982 (2006).
[4] M. S. Lewicki, “Efficient coding of natural sounds,” Nature Neuroscience 5(4), 356–363  (2002).
[5] P. Y. Oudeyer, Self-organization in the evolution of speech (Oxford University Press, USA, 2006), pp. 32–51.
[6] B. Lindblom, “Role of articulation in speech perception: Clues from production,” Journal of the Acoustical Society of America 99, 1683–1692 (1996).
[7] P. K. Khul, “Discrimination of speech by nonhuman animals: Basic auditory sensitivities conducive to the perception of speech-sound categories,” Journal of the Acoustical Society of America 70(2), 340–349 (1981).
[8] S. M. Wilson, A. P. Saygin, M. I. Sereno, and M. Iacoboni, “Listening to speech activates motor areas involved in speech production,” Nature Neuroscience 7, 701–702 (2004).
[9] A. M. Liberman, F. S. Cooper, D. P. Shankweiler, and M. Studdert-Kennedy, “Perception of the speech code,” Psychol. Rev. 74, 431–461 (1967).
[10]  A. M. Liberman and I. G. Mattingly, “The motor theory of speech revised,” Cognition  21, 1–36 (1985).
[11]  C. A. Fowler, “An event approach to the study of speech perception from a direct-realist approach,” J. Phonetics 14, 3–28 (1986).
[12]  T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley Interscience, New York, 1991), pp. 12–49.
[13]  J. R. Westbury, “X-ray microbeam speech production database user’s handbook version 1.0,” http://www2.uni-jena.de/∼x1siad/uwxrmbdb.html(date last viewed 6/15/2010), (1994).
[14]  M. Yanagawa, Articulatory timing in first and second language: a cross-linguistic study (Doctoral dissertation, Yale University, 2006).
[15]  L. Goldstein, I. Chitoran, and E. Selkirk, “Syllable structure as coupled oscillator modes: Evidence from Georgian vs. Tashlhiyt Berber,” Proceedings of the 16th International Congress of Phonetic Sciences, Saarbrucken, Germany,241–244 (2007).
[16]  S. J. Perkell, M. Cohen, M. Svirsky, M. Matthies, I. Garabieta, and M. Jackson, “Electromagnetic midsagittal articulometer systems for transducing speech articulatory movements,” Journal of the Acoustical Society of America 92, 3078–3096 (1992).
[17]  E. L. Saltzman and K. G. Munhall, “A dynamical approach to gestural patterning in  speech production,” Ecological Psychology 1, 333–382 (1989).
[18]  C. P. Browman and L. Goldstein, “Gestural specification using dynamically-defined articulatory structures,” Journal of Phonetics 18, 299–320 (1990).
[19]  K. Johnson, Acoustic and auditory phonetics, 2 ed. (Wiley-Blackwell, 2003), pp. 51–52.
[20]  E. Zwicker and E. Terhardt, “Analytical expressions for critical-band rate and critical bandwidth as a function of frequency,” Journal of the Acoustical Society of America 68, 1523–1525 (1980).
[21]  A. Makur and S. K. Mitra, “Warped discrete-fourier transform: Theory and applications,” IEEE Transactions on Circuits and Systems-I: Fundamental Theory and Applications 48, 1086–1093 (2001).
[22]  J. O. Smith and J. S. Abel, “Bark and ERB bilinear transforms,” IEEE Transactions on Speech and Audio Processing 7, 697–708 (1999).
[23]  P. K. Ghosh and S. S. Narayanan, “Bark Frequency Transform Using an Arbitrary Order Allpass Filter,” IEEE Signal Processing Letters 17(6), 543–546 (2010).
[24]  S. W. K. Fu, C. H. Lee, and O. L. Clubb, “A survey on Chinese speech recognition,” Communications of COLIPS 6(1), 1–17 (1996).
[25]  G. M. White and R. B. Neely, “Speech recognition experiments with linear prediction, bandpass filtering, and dynamic programming,” IEEE Trans. Acoust., Speech, Signal Processing ASSP-24, 183–188 (1976).
[26]  P. Cosi, Y. Bengio, and R. D. Mori, “Phonetically-based multi-layered neural networks for vowel classification,” Speech Communication 9(1), 15–29 (1990).
[27]  J. Picone, “Signal modeling techniques in speech recognition,” Proceedings of the IEEE 81, 1215–1247 (1993).
[28] A. Forge and T. Wright, “The molecular architecture of the inner ear,” British Medical Bulletin 63(1), 5–24 (2002).
[29] K. K. Glendenning and R. B. Masterton, “Comparative Morphometry of Mammalian Central Auditory Systems: Variation in Nuclei and Form of the Ascending System,” Brain Behavior and Evolution 51(2), 54–89 (1998).
[30] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis (New York:Wiley Interscience, 2000), pp. 85–92.
[31] G. A. Darbellay and I. Vajda, “Estimation of the information by an adaptive partition of the observation space,” IEEE Transactions on Information Theory 45, 1315–1321 (1999).









The University of Southern California does not screen or control the content on this website and thus does not guarantee the accuracy, integrity, or quality of such content. All content on this website is provided by and is the sole responsibility of the person from which such content originated, and such content does not necessarily reflect the opinions of the University administration or the Board of Trustees